COGS 108 Final Project

Home
Announcements
Lectures
Assignments

Instructional Team

Links
Frequently Asked Questions
Textbooks
Grading
Office hours
Handouts
Labs and locations
Course Description

 

Site last updated:

Table of Contents

 

Introduction

There will be a group project associated with this course. It provides an opportunity for students to practice formulating a data science question, consider what data and analysis would be needed to test a hypothesis associated with the question, locate and analyze that realworld data, come to a conclusion and communicate the results in a scientific report format.

Group formation

The project will start in week 2, but we will begin forming groups immediately since time is short in the summer session. We will provide a team signup sheet where you can tell us who your chosen team members are or select to be assigned. If you have not found a group by Thursday of week 1 we will assign you. Once you have met your assigned team (in the case you are assigned) if you want to change groups you need to change as early as possible and we will provide a procedure (secondary google form for changing groups).

Group size

Groups should consist of optimally 4-5 students, though we have had single member groups and groups as large as 7. The extremes pose challenges. Too few students and it is a lot of work for fewer people. Too many students and it is difficult for everyone to have everyone get enough involvment. If you decide on less than 4-5, we suggest you plan early and choose an attainable question, dataset and analysis. If you decide on a larger team, you need to be very organized who is doing what task and ensure that everyone is given a significant enough role. You may need to choose a more complex problem or put more into the analysis.

Project structure

The entire project will be turned in to a group github repository we will create for you and give you access to. You will practice version control there, and keep data, as well as turn in all the pieces there in the form of jupyter notebooks and a final video.

The project will consist of multiple checkpoints to help you break down the task into managable pieces, culminating in a final report along with a video presentation that you record and a group review form to provide feedback for each others' participation. It might look intimidating, but each element actually helps you draft your final report. By the time you get to the second checkpoint, you will have a rough draft complete and just need to refine it in the last week of the class. We will also be helping with the last assignment to reduce the time demand, so you can put the most effort into the project.

The pieces are as follows:

Project component Group/Indiv. Submission type Due date

Previous project review

Group Google form Friday 7/14 at 11:59pm

Project Proposal

Group Jupyter notebook to github Friday 7/14 at 11:59pm

Checkpoint 1: Data

Group Jupyter notebook to github Friday 7/21 at 11:59pm

Checkpoint 2: EDA

Group Jupyter notebook to github Friday 7/28 at 11:59pm

Final Report

Group Jupyter notebook to github Saturday 8/5 at 5pm

Video presentation report summary (5-10min)

Group github (method of choice) Saturday 8/5 at 5pm

Group review

Individual Google form Friday 8/4 at 11:59pm

Video reviews (depending on timing, either min 3 required or all extra credit, 0.5% each for up to 6% bonus overall)

Individual Google form Sunday 8/6 at 11:59pm

 

Previous quarter examples

You can see some examples of previous works here:

https://github.com/COGS108/FinalProjects-Fa21

If you look at any of the group_xx-xxxx repos and look for 'finalproject' notebooks you will see examples. I will add specific ones, the above was chosen to allow you to search for what interests you from that particular quarter.

 

Description of project parts

1. Previous project review (Due Friday 7/14/2023 at 11:59pm)

Here you will review 2 previous projects as a team and submit one google form that asks a few questions about each project. The questions are designed to orient you to the project and get a sense of what you are going to do. It will also get you to think about 1)What will the final report look like in terms of structure? 2)What are some of the questions you can ask? 3)How will you go about answering those questions?

We recommend you each as individuals look over the projects and questions, make notes then come together as a group to put together the final submission. However if it works better for you to do it all as a team it is up to you.

You will be surprised at what you will be able to formulate and refine. While it helps to see what has been done don't be afraid to think outside the box, this is what good scientists and engineers do in order to take human knowledge and achievement in new directions!

Requirements:

  • Review 2 of the project options listed in the google form as a group
  • Submit 1 per group, filled out completely
  • The survey will have you list your group teammembers, please be sure to do that correctly with names listed as they appear in canvas

Note: You can select a different project from previous COGS108 projects in our github repo or on the standard COGS108 course repo if you would like but don't spend much time searching given the timing of the assignment.

 

 

2. Project proposal (Friday 7/14/23 at 11:59pm)

In this part of the project you will, following the format linked in here available in the projects github repo for COGS108 here,

  • Question - Write a draft of your question
  • Hypothesis - Given your question, boil it down to a scientific hypothesis and null hypothesis that will be tested in your study
  • Data ethics and privacy - Provide an overview of how your project will consider data ethics and privacy now and for the future
  • Literature review - Provide a background review of the literature (a draft you will add to as the session progresses)
  • Data - Locate at least some of the data you will probably use, and where you might find more if needed
  • Schedule - Create a rough schedule
  • Tasks - Plan out tasks as much as possible and as specifically as possible as far as you can predict - if some people in the group are great writers, they can take lead on that part (with everyone contributing for experience, perhaps with guidance), and if some are more experienced in coding or data science they can take lead on that (with everyone contributing for experience, perhaps with guidance)
  • References - you will be needing to review the existing literature on your topic and include the most relevant publications as references in APA or MLA format. There will be a literature review section of your final report, and we will discuss how that works.

It is understood that this early on all of this is a draft, and we have a short time to put this together, but the more effort that goes into this part, the more you will cruise through the rest. The idea is to get far enough to have a sense of whether your question makes sense and is answerable with data, or needs refinement.

We will look at your submissions and provide feedback to help you refine the question and your plans for how you will work with and analyze the data (not tear apart what you submit). Asking the right question can mean the difference between a smooth project and a project with many necessary significant changes (such as a totally different dataset).

Given summer session timing we have not done a great deal of review of data science technique yet and we have kept this in mind in planning the project. You will be adding further specifics as you go, some aspects will change, and this is why we have a data checkpoint and exploratory data analysis checkpoint later.

Requirements:

  • Fill out the Project proposal jupyter notebook (ProjectProposal_groupXXX.ipynb ) included in github repository created for your group
  • Submit notebook to github repo created for you at drsimpkins-teaching by Friday at 11:59pm

3. Checkpoint 1: Data (Friday 7/21/23 at 11:59pm):

For this checkpoint you should have located one or more datasets associated with your project, accessed it and performed the majority of or all of your data wrangling and cleaning in order to get it into a usable form for your analysis. You will write up a short description of the data, where it comes from and what it represents, how it is structured and you will include your code, well commented.

This is not where you will have performed much analysis necessarily, though you should have done some basic visualization or otherwise show that the data is going to be useful for your question/hypothesis. You may at this time decide you need more data, different data or otherwise. To complete the checkpoint you should have a good idea where, if you need more data, you can get it – perhaps you have it but have not completed the secondary cleaning and wrangling, but have the framework from the first dataset.

The main idea is – get your data, get it wrangled so you can operate on it, take a quick look at it. Plots are not required and certainly no significant analysis. But try to have a sense of if this is the dataset you will be sticking with or if you need more data, you should have put significant work into getting it even if it is too late to fully incorporate it into your checkpoint.

Data description requirements: it is structured (or you can make it structured), what it contains, and any characteristics needed for your data science project

  • Data description including what the data is, how it supports answering your question, motivation as to why this is the dataset you will be using
  • Data wrangling and code showing you have gotten the data into python/jupyter and have it in a usable/accessible form – you have removed missing values, NaNs, it is rectangular etc

  • Next steps – a brief statement of what you plan to do next to perform EDA on the data, but you do not have to have actually done this yet

Requirements:

  • Fill out the project checkpoint included in github repository on the main page for the course in the project directory (DataCheckpoint_groupXXX.ipynb)
  • Turned in to your group github repo by the deadline
  • Data wrangled/cleaned with commented code and description of steps
  • Data description filled in as described above
  • Some descriptive representation of the data - pre-visualization, demonstrate the data is input and processed, ready for use

4. Checkpoint 2: EDA (Friday 7/28/23 at 11:59pm):

For this checkpoint the goal is to, now that you have your data, work towards the goal of rejecting or failing to reject the null and gaining insight about your data science question. So explore the basic statistics of your data - central tendency and variability. If it has a normal distribution you can use standard statistics, and otherwise you can explore with nonparametric statistics. Then given those insights perform various visualizations, generate tables or other ways to 'look at' the data. Finally, as you probably have an idea of what type of modeling you would like to do, execute at least a good portion of the modeling - regression, curve fits, if you are doing machine learning try to get as far as possible on this. It is understood you may be expanding this for the final report, as that is why we call it 'exploratory' instead of'super final done from every angle' or similar. Keep in mind the more you explore your data, the better the picture you will get in your head as to the final statements you can make about it.

Overall consider this a first rough draft of the report, with potentially missing sections such as the conclusion and results section might be not fully written. You want to explore as far as possible and have a good idea of the last bits of modeling you want to do in order to make a statement and support it regarding your question and hypothesis.

 

Requirements:

  • Fill out the project checkpoint included in github repository on the main page for the course in the project directory (EDACheckpoint_groupXXX.ipynb)
  • Turned in to your group github repo by the deadline
  • Data should be explored through basic descriptive statistics, visualizations, hypothesis testing (inferential analysis)
  • You should have completed basic modeling of the data. The above analysis will give insight as to where to take the data, and it's possible that it might take you in an unplanned direction that is not complete by the checkpoint. That's ok, but you should do as much modeling as is reasonable given the timing.

 

5. Final report and video: Report due Friday 8/4/2023 at 11:59pm

For now please reference the final project template. Essentially you will go from the EDA to the final modeling (if modeling), write your results and conclusion section, and refine the EDA checkpoint level of your draft. Correct spelling, grammar, and clean up the code, include comments as well. The conclusion is where you consider the scientific hourglass shape - broad initial starting point that narrows down to a hypothesis, then the breadth increases again until the conclusion where you make your statements about the implications and future work.

Report Requirements:

  • Fill out the project file FinalReport template included in the github repository on the main page of the course in the project directory (FinalProject_groupXXX.ipynb - raw file)
  • Turned in to your group github repo by the deadline
  • You should have taken the EDA material and expanded through your final analysis and modeling, drawn a conclusion and refined your earlier checkpoints to include all the sections in the report template

Final report video (due by Sat at 12 noon, so we can create a list for people to watch and do video reviews for EC)

Video instructions: 5-8m video introducing your project, data, results and conclusion. Can be (encouraged to be but no hard requirement) a set of slides, but you can express how is most effective for your project. It should be planned out however, so outline it first.

Requirements:

  • length ~5-8m, must be less than 10min
  • Cover the project, question, data, methods, results, conclusion briefly
  • Slides are useful but not required, however try to avoid jumping around your report, plan it out and outline so it's repeatable and practice a few times
  • Share as a file, link, in your repo, on youtube, it's up to you just make sure we can access it
  • Everone needs to have involvement, but everyone does NOT need to speak. We do not require video of your face, but should see your slides/images/report or whatever form you are presenting
  • You have leeway to be creative, and time is limited but do your best to express it as professionally as possible

Group peer review form here

Ideas for open datasets:

Here you will find several links to open datasets from a variety of sources. You can locate and use your own, and you can also use your own datasets, if they are stripped of personally identifiable information (PII) and you have the rights to use them.

More info to come...