Python - Transforms (School absense codes)
- Due Nov 16, 2015 by 9am
- Points 10
- Submitting a file upload
- File Types pdf
- Available after Nov 12, 2015 at 9am
For this assignment you will convert data between two schemas (schemata? whatever). You will have to use skills from the Transforms screencasts, as well as code to read/write CSVs and looping code. Remember: Questions in the discussion forum: Assignment Discussion: Python - Transforms (School Absence Codes)
The data tracks possible absences from school by students. Students can be absent for four reasons: Sick, Religious Holiday, Parental Request or Unexplained. The schemas include the date of the absence.
The first schema (which you will receive data in) has these columns:
student_id, year, month, absence_code_string
The absence code string is exactly 31 characters long, one for each possible day in a month. These characters take one of 5 forms: "S" for Sick, "R" for Religious Holiday, "P" for Parental Note, "U" for Unexplained, or "." (a dot) for "Present". Consider these rows (spaces added between columns for clarity):
22345341, 2015, April, ...............................
22345343, 2015, April, S..............................
22349590, 2015, April, .......R.......................
So if a student had the string "..............................." (that's 31 dots, trust me) then the student was not absent at all that month. If the student had the string "S.............................." then the student was Sick on the first day of the month. If the student had ".......R......................." then the student was absent for a Religious Holiday on the 8th day of the month. So there's quite a bit of information encoded into the absence_code_string field, which you have to parse out.
A full csv to work with (although you should test with a smaller version made using "head" is here, will copy to your home folder
cp -r /home/jhowison/data_wrangling_class_files/11_transforms/11_3_spliting_out_data/student_absences.csv ~/
The destination schema is a table called "absences" and has these columns:
id, student_id, date, absence_reason
As you can see the table only keeps track of absences, and doesn't record days that the student attended. The absence reason code has full words, rather than letter codes. e.g., Our examples above would translate to these rows:
32, 22345343, 2015-04-01, Sick
33, 22349590, 2015-04-08, Religious Holiday
In the comments for your conversion script answer these questions:
- In what ways is the second schema an improvement?
- Is any information present in the first schema that isn't in the second schema? If so, how might you solve that (might involve other tables)
- What aspects of the conversion were most difficult? What helped you in thinking it through?
Rubric
Criteria | Ratings | Pts | ||
---|---|---|---|---|
Q1
threshold:
pts
|
|
pts
--
|
||
Q2
threshold:
pts
|
|
pts
--
|
||
Q3
threshold:
pts
|
|
pts
--
|
||
Overall Logic
threshold:
pts
|
|
pts
--
|
||
Converting letter to string
threshold:
pts
|
|
pts
--
|
||
Getting the date
threshold:
pts
|
|
pts
--
|
||
Zero padding the dates
threshold:
pts
|
|
pts
--
|
||
Total Points:
10
out of 10
|