Tidy Data

Home 4BSW1 4BSW2 4MTLAT/4LAT 4MWW1 4MWW2 4NWE2 5BWE 5EWI/5LWI/5WWI1 5WWI2 About

1. Tabel: Best Practices

één tabel per tabblad
zo klein mogelijk (“minimal redundancy”)
kolomnen: variables
rijen: observaties, artikels, …
“lookup”-kolommen = index = unieke identificatie van elke rij

2. Excel Basics

Noot: wij zijn er niet tot “data opruimen” (zie beneden) geraakt. Wij hebben in plaats daarvan een stap terug gezet en gaan met Excel basics voort.

Ik laat de info’s beneden staan voor geïnteresseerde, wij komen er later op terug.

3. Data Opruimen

U vind een data set online:

smartschool >> vakken >> 5BWE Informaticawetenschappen >> documenten >> “data klaslijst”
of https://mielke.ws/dat/data_klasoverzicht.csv

klas_nummer	1	2	3
naam	Benito Yoshino	Elsie Baker	Vera Ladner
geslacht	m	v	v
jarig	21/10/2010	10/03/2010	22/11/2010
hoofdstuk 1	0.4	0.75	0.59
hoofdstuk 2	0.51	0.67	0.58
hoofdstuk 3	0.39	0.77	0.52
dagelijks werk	0.51	0.7	0.66
examen	0.59	0.79	0.65

Doel 1: toepassen principes boven! Doel 2: Scoren meisjes op dit vak beter dan jongens?

STAP 1: (klas) Wat moeten we verbeteren?
STAP 2: (groepjes van 2) Plan: hoe gaan we dit doen?
STAP 3: (iedereen) uitvoeren
STAP 4: (klas) analyse

4. Nuttige Functies

VERT.ZOEKEN
HORIZ.ZOEKEN
VERSCHUIVING
INDIRECT
IFERROR

5. Further Reading

https://tidyr.tidyverse.org/articles/tidy-data.html

6. Data Generation

Hier zien jullie hoe ik de data aangemaakt heb.

import numpy as NP # numerics
import numpy.random as RND # random numbers
import pandas as PD # data tables in python
import names as NAME # random names
import datetime as DT # date and time

RND.seed(42) # this makes that random numbers always come out the same on my computer

n_students = 24 # the number of students in the hypothetical class


averages = 1. - RND.beta(2., 5., n_students) # values between zero and one often follow a beta distribution: https://en.wikipedia.org/wiki/Beta_distribution
sex_is_male = RND.choice([0., 1.], 24) # whether a student is male (1) or female (0)

names = [NAME.get_full_name(gender = 'male' if male else 'female') for male in sex_is_male] # choose random names
# (I'm sorry that this all seem to be American/British names.)

# get random birth dates, some time in 2006 or 2007
start_year = DT.datetime.strptime('1/1/2006', '%d/%m/%Y')
birth_day = [start_year + DT.timedelta(days = int(day)) for day in RND.uniform(0, 365*2, n_students).astype(int)] # gets some random birth dates within a year
birth_day = [bd.strftime('%d/%m/%Y') for bd in birth_day]


def GetTestScores():
    # this function will generate random test scores.

    scores_base = averages + RND.normal(0., 0.1, n_students) # roll the dice
    scores_base -= 0.05 * sex_is_male
    failure = RND.uniform(0., 1., n_students) < 0.05 # sometimes, a student will fail!
    prepared = RND.uniform(0., 1., n_students) < 0.05 # sometimes, a student will be prepared!

    scores = scores_base - failure * 0.1 + prepared * 0.1 # add in failures and preparation

    # check limits: scores are between 0% and 100%
    scores[scores < 0.] = 0.
    scores[scores > 1.] = 1.

    return NP.round(scores, 2) # return rounded test results.


# combine the data
data = PD.DataFrame.from_dict({ \
                                'naam': names \
                                , 'geslacht': [('m' if male else 'v') for male in sex_is_male ] \
                                , 'jarig': birth_day \
                               })
data.index += 1
data.index.name = 'klas_nummer'

# calculate test scores
for test in ['hoofdstuk 1', 'hoofdstuk 2', 'hoofdstuk 3', 'dagelijks werk', 'examen']:
    data[test] = GetTestScores()


# save the data on the hard drive.
data.to_csv('data_klasoverzicht.csv', sep = ';', decimal = ',')
print (data.head(3).T)

vorige les \(\quad\) volgende les