Tools of the Trade: Lesson Plan


6 Hours


Contents


Section
Overview
Session Materials
Prerequisites
Learning Objectives
Assignment
Technical Knowledge
Skills, Attitudes and Behaviours
KM1 Data Anlaysis Tools Syllabus
Recordings
Session Outline
Additional Resources
Coach Notes

Overview


In this module we will be covering various tools apprentices will need to enhance their analytics performance in their roles. In the first half of the session we will give an overview of Big Data, considering how it works and what the benefits and drawbacks are of using it. This is followed by a discussion around data analytics platforms and comparing their usage to coding it yourself.


In the second part of the module we will be looking into the statistical programming language R, briefly looking at how many of the processes we learned in Python can be performed in R.



Materials


Session 1 Slide Deck

Session 2 Data


Prerequisites




Learning Objectives




Assignment


Part 1


Task


Justify the use (or lack of) Big Data technologies in your role


Things to Consider



Part 2


Task


Embed R knowledge and skills by building a model to predict species of iris


Things to Consider


How can you use these skills in your role? Revisit a previous project and see if you can recreate it using these techniques



Technical Knowledge




Skills, Attitudes and Behaviours




KM1 Data Analysis Tools Syllabus




Recordings (coach use only)


Link

Session Outline


Section Approx. Timing
Class Introduction 10 Minutes
Understanding Big Data 20 Minutes
Five V's 25 Minutes
Developing the Future 20 Minutes
How Does It Work? 10 Minutes
Break 10 Minutes
Using Big Data in Your Role 10 Minutes
Advantages and Disadvantages 20 Minutes
Big Data Products 5 Minutes
Data Platforms 20 Minutes
Platforms vs Coding Yourself 20 Minutes
Setting Up R 5 Minutes
Session 1 Recap 5 Minutes
Class Introduction 10 Minutes
Features of R 30 Minutes
Control Flow in R 30 Minutes
EDA in R 35 Minutes
Break 15 Minutes
Visualisation in R 35 Minutes
Linear Modelling in R 15 Minutes
RStudio 5 Minutes
Session 2 Recap 5 Minutes


Additional Resources




Coach Notes


Topic Class Introduction Duration 10 Minutes
Objectives
  • To provide an overview of the class agenda and the expected learning objectives
Notes

Coach welcome apprentices to lesson and run through the session outline and learning objectives. Coach can run an ice breaker from here .

Contents

Topic Understanding Big Data Duration 20 Minutes
Objectives
  • Define what we mean by 'Big Data'
Notes

In this section we will be introducing the concept of Big Data and why it is important to both the apprentice and their employer.


Check for apprentice understanding of what Big Data is by asking them to annotate the screen. You can ask some follow up questions like:

  • How would you use it in a working environment?
  • What types of data do we mean by 'Big Data'?
  • Does your organisation have a Big Data strategy?

Show apprentices the quote by McKinsey and ask for their opinion. State that organisations will need to continuously adapt to be competitive in their market due to the ever growing amount of data that is being generated. Not only storing this data (from social media, phones, imaging, etc) but analysing it too especially when it is not in usual structures.


Lead a discussion around the fact that Big Data is not just about volume, but velocity too. Data is growing exponetially and companies need to keep up. Apart from storage there is no real limit to how big 'Big Data' can be but for it to be considered 'Big Data' it needs to be sufficiently big that traditional processors (like your computer) can not handle it.


Before showing the examples to apprentices, ask them to guess how much data twitter, NYSE and Facebook generate every day. You can show the apprentices the links (particularly with twitter) to see real time how it is growing.

Contents

Topic The Five V's Duration 25 Minutes
Objectives
  • Understand the characteristics that make up 'Big Data'
Notes

In this section you will be guiding apprentices through five facets of Big Data that apprentices should consider. These are volume, velocity, variety, veracity and value. For each section explain what these terms mean and are a mental checklist that apprentices should use when evaluating their data


Most of the notes are on the screen for this section, but take time to ask apprentices to annotate/suggest meaning to each and how they might look in their business. For example, how voluminous is the data they use regularly? What difference would it make to a company if they considered the value of Big Data? Also get them to consider the challenges posed by Big Data against the 5 v's, such as how difficult it is to keep up with fast flowing data, or how agile a dashboard will be if it is using live Big Data.


Give apprentices 10 minutes to complete the activity- in breakout rooms they should discuss the data they use in their everyday role and evaluate it against the 5 v's- how volumious, how fast, how varied, the data quality and the value to the organisation. They should share these in their room and be prepared to say in front of the whole group. If an open cohort try to make the breakout rooms be employer specific if possible.

Contents

Topic Developing the Future Duration 20 Minutes
Objectives
  • Understand the potential Big Data Analysis can have on our industries
Notes

In this section you will be helping apprentices understand the impact Big Data has had on the world and from there consider how it can impact their business. To start them off, ask how utilising Big Data technologies has impacted medical science.


Through use of Big Data human genome mapping has provided a detailed understanding of genetic makeup and lineage. The healthcare industry therefore is in a position where it can better predict genetic illnesses and provide interventions much earlier. Although data has grown this has seen a reduction in the cost to sequence one human genome. Are apprentices aware of any other businesses that have been able to leverage Big Data?


What about amazon? Alexa uses a combination of NLP and Big Data to provide a huge range of services. Give apprentices 10 minutes in breakout rooms to discuss and map out how they think Alexa makes recomendations. Once back, guide them through an example of how Alexa takes a command and uses NLP and Big Data to give its response. As an illustration there is a video of an alexa being given a command to switch on a light to demonstrate how quickly it works.

Contents

Topic How it Works Duration 10 Minutes
Objectives
  • Understand the principles behind how Big Data technologies work
Notes

In this section you will be guiding apprentices to understand how Big Data technologies can make their analysis quicker. For example, think about how long it takes to process a few thousand rows of data on your computer, how much longer would this be if the rows numbered tens of millions? Big Data technology will help speed up this process.


Explain the concept of divide and conquer in the context of data analysis. Draw particular attention to the splitting of processing across several servers to speed up the process and avoid fatal errors from crashing


Gove apprentices 5 minutes to answer the maths question before discussing the answer. Use this as a platform to discuss how much more quickly their own analysis can be conducted if they use this technology

Contents

Topic Using Big Data in your Role Duration 10 Minutes
Objectives
  • Consider how Big Data technology can be utilised in your role
Notes

In this section apprentices will be considering how they could use Big Data in their role after considering everything they have heard so far. Give them 5-10 minutes in breakout rooms to sound ideas off and then feedback by annotating the screen.

Contents

Topic Advantages and Disadvantages Duration 20 Minutes
Objectives
  • Justify the use of Big Data in analysis by considering the benefits and drawbacks
Notes

In this section you will lead a discussion on the pros and cons of using Big Data. In particular, get apprentices to consider their portfolio and how they can evidence this discussion. Run throigh the advantages and disadvantages and ask apprentices if they can suggest any others. Also ask how they feel about these and if they are justified. What are their thoughts?


For the activity give them 10 minutes to read the article and discuss their thoughts. Can they think up other arguments that support or disagree with the writer?

Contents

Topic Big Data Products Duration 5 Minutes
Objectives
  • Be aware of the various Big Data technologies that exist
Notes

This is the final section on Big Data and so far they have considered the theoretical implications of the technology. They have thought about how it works, how they can use it and the advantages and disadvantages. In this section run through the various technologies that exist, for example aws (Amazon), Spark and Hadoop (Apache), BigQuery (Google) and Azure (Microsoft). State that Hadoop was one of the first to come to market and many companies still use it. State that each has a wide array of functionality and integrate well with langauges such as python, git and R. Ask if any apprentices have experience of these softwares adn what they think.

Contents

Topic Data Platforms Duration 20 Minutes
Objectives
  • Consider the various platforms that exist
  • Reflect on and justify the products you use
Notes

In this section you will guide apprentices through a discussion on the various data platforms that exist. Data analytics is a fast growing industry and many new products that make all aspects of a project (dtaa mining, storage, cleaning, analysis, visualising, modelling, etc) are coming onto the market. Many have been around for a long time and apprentices may have already been using them. It is important that in their portfolio that apprentices explain what software they used and justify why, this section will help them in that thought process.


Go through each of the types of platform and ask what experience the apprentices have of each. Do any make use of visualisation software? What are the pros and cons? Follow similar lines onf enquiry through data management (Excel), statistical analysis (SPSS), cloud based computing and servers (Azure) and general data analytics (Knime, Alteryx, Anaplan, Jira, Qlik).


Note some advantages and disadvantages for each, such as SPSS being expensive to license but incredibly quick to perform statistical analysis once you know what you are doing. Ask the apprentices to suggest these as well. The next section will discuss these in more detail.

Contents

Topic Platform vs Coding Yourself Duration 20 Minutes
Objectives
  • Justify the use of platforms by comparing them to coding
Notes

Apprentices will use many platforms and softwares in their role and it is important they discuss and justify these in their portfolio. A common comparison is coding, why pay out for a license when you can do it yourself? A common example an apprentice might give is that in a time pressure situation it is easier to use something they are familiar with (Excel for example) but will be willing to experiment if they had longer.


Go through each slide and lead a discussion on each, do they think they are fair? What do they feel about coding? Would they prefer to do it if theyhad the opportunity? You can use opinion polls to understand their feelings.


For the activity split the apprentices into small groups and give them 7 minutes to defend a software or platform (I would suggest Excel, Python and Tableau/PowerBI). Afterwards they have to nominate one person to give a 1 minute pitch on why their software would be the best to use

Contents

Topic Setting Up R Duration 5 Minutes
Objectives
  • Ensure apprentices have access to R
Notes

Follow the on screen instructions to set up an R environment on Jupyter Notebook. Although GUI's like RStudio are more common, we are using Jupyter as apprentices are already familiar with it from the data science immersives and we don't need to spend time explaining a new system.

Contents

Topic Recap Duration 5 Minutes
Objectives
  • To recap the day
Notes

Go through the learning objectives again and show apprentices the first part of the assignment and check for their understanding. Remind them to complete session attendance log and update their OTJT.

Contents

Topic Class Introduction Duration 10 Minutes
  • To provide an overview of the class agenda and the expected learning objectives

Coach welcome apprentices to lesson and run through the session outline and learning objectives. Coach can run an ice breaker from here .

Contents

Topic Features of R Duration 30 Minutes
Objectives
  • Understand the basic features of R
Notes

R is a statistical language released in 1995 at the University of Auckland by Ross Ihaka and Robert Gentleman. Explain to apprentices that it is a language built for statistical analysis and is more optimised for tasks like hypothesis testing than python is. Python and R have many similarities, so use this as an introduction to the R syntax (and particularly note the differences). Throughout the session you will be showing different way of achieving what you could in python, so throughout stress how R is more suited to statistical analysis and many analysts do in fact prefer using it.


For this session you will be using Jupyter Notebook as apprentices will be familiar with it from the DS intensives.


Moving through this section you will be guiding apprentices through the R basics, such as how to declare variables. Note the use of <- to define variables. You can use = (and they accomplish the same thing) but traditionally the former is used. Also note that R code can be written in one line, just use ; to seperate out the instructions.


You will spend a lot of time talking through the different types of structures (list, vector, matrix, dataframe). A list in R shares similarities with both lists and dictionaries in python, but the key takeaway is the difference between a list and a vector- draw attention to this. From there, matrices and dataframes are similar (again note the difference, dataframes are tables while matrices are an extension of vectors) and go through the various operations such as how to reference, add and subtract rows & columns. We will cover more advanced manipulation techniques in the following section


Don't give long for the exercise, 5 minutes max and breakout rooms shouldn't be necessary. This is a good opportunity for apprentices to share their screen and demo their code.

Contents

Topic Control Flow in R Duration 30 Minutes
Objectives
  • Understand how to set up loops and functions in R
Notes

Depending on the skill level of the group, this can be skipped.


In this section you will be demonstrating control flow which should be familiar from python (DS1). You are only introducing the syntax for R (as the concepts have been covered previously) so don't spend too much time on this section. Show apprentices the logic syntax (effectively same as python) and how to set up a statement. Note the difference from python, before they used a colon (:) to start a statement, but in python they use curved brackets ({}). Note that R is not as fussy as python when it comes to indenting our code, but we still should as it is best practice and makes our code more readable.


As you move through the rest of this section (for, while and functions) you should note that the format is similar each time and everything they can do in python, can be done in R.


For the exercise I would put apprentices in breakout rooms and give 10 minutes plus run through the solutions. Example solutions are embedded into the notebook, but encourage apprentices to give it a go before looking.

Contents

Topic Exploratory Data Analysis in R Duration 35 Minutes
Objectives
  • Consider how to perform EDA in R
Notes

In this section you will be introducing the concepts of EDA in R. As we have covered the principles of EDA in Excel, SQL, Tableau and Python already there will be no need to go into too much detail here. However, it may be a good idea to ask apprentices what principles they can remember (cleaning data, gathering aggregates to summarise, identify which features will be useful for prediction, etc)


As many of the functions we will need are built into tidyverse, you are going to have to show apprentices how to install and call the library. Note the similarity with python and again how easy it is to do (and forget to do...). Tidyverse does quite a lot, so encourage apprentices to explore different functions and see how they can help them with their analysis.


The first thing you will do is show apprentices common functions, explain that each will be demonstrated and discussed shortly. Start with filter and arrange. You can demonstrate how to check for null values using is.na(). Note that to apply more than one function you will need to wrap each new one around the previous and that can get confusing. To counter this, introduce and demonstrate piping as a method to clean our code up and encourage apprentices to try using it.


Moving on from there demonstrate how to manipulate and rearrange dataframes as wel as how to create new columns and create aggregates. Link back to how they did this in python, its just the syntax that is different. Encourage apprentices to edit the sample code (give time for this) and allow them to ask questions. If time, you can set challenges, such as what is the average price for bottles with proof greater than 40 for example.


For the exercise I would give 10 or so minutes in breakout rooms before going through the solutions. Again, sample solutions are built into the notebook.

Contents

Topic Visualisations in R Duration 35 Minutes
Objectives
  • Build Visualisations in R
Notes

In this section you will be introducing how to build visualisations using ggplot2 (built into tidyverse). There are other packages available, but we will stick to this one for now. The synatx for building a visualisation is different to python, so it may take apprentices a bit longer to pick this topic up. Encourage apprentices to code along with you and give them time to build the visualisations themselves and ask questions.


As you move through the section you will demonstrate how to build a scatter graph, bar chart and boxplot. The initial set up is the same each time ggplot(data=, aes=()) (draw attention to this) but show how the next statement changes depending on what you want. Show apprentices how to customise their visualisations by use of colours and sizes (e.g. size=continent). You can demonstrate a line graph by using geom_line().


For this exercise I would give at least 15 minutes in breakout rooms. Apprentices may find this section more challenging, so beprepared to walk through code again. As a stretch challenge there is a further exercise where apprentices have to recreate a visualisation. Again solutions are built into notebook.

Contents

Topic Linear Modelling in R Duration 15 Minutes
Objectives
  • Learn about different modelling techniques in R
Notes

In this section you will demonstrate how to build a linear model in R. This is a more advanced topic and can be skipped if running out of time. Explain to apprentices that R is built for this sort of programming and arguably is better suited than python for it. Python is a better all round language and has a much wider array of functionality, R is a narrower in scope but is more powerful in what it does (especially as much ML is prebuilt in).


Start with showing linear regression and show how it is similar to the statsmodels pakcage from python (funny that...). Show how to build and analyse a model. You can explore coefficients by typing summary(model)$coefficients. Explain that the principles of linear modelling are the exact same as with python (R-Squared, etc).


For logistic regression show that in R this is called logit and is accessed through generalised linear models. For the example in the notebook show how to build a binary feature using ifelse and then how to build the model.


Finally (if you have time) show how to perform a t-test. This was explored in DS2, so apprentices may need a refresher on how hypothesis testing works. Explain that normally you would check for assumptions (such as normality) and perform other EDA but for today you are just demonstrating the code. Walk through how to set up the t-test and how to analyse it. Explain that R has a wide array of models built in (chi square test, f tests, ANOVA, etc) and is much better suited for it than python.

Contents

Topic RStudio Duration 5 Minutes
Objectives
  • Be aware that RStudio exists
Notes

We used Jupyter because apprentices are already familiar with it. In most industry cases they would use RStudio. If you have time, download and bring up RStudio to show how it works and although the GUI is different, the syntax is the same. Otherwise, show the GUI screenshot and point out its differences and similarities to using Jupyter.

Contents

Topic Recap Duration 5 Minutes
Objectives
  • Recap the day
Notes

Go through the learning objectives and take any questions. Show apprentices the R assignment and encourage them to look at a previous project and try and recreate it in R. Remind them to fill out session attendance log and update OTJT.

Contents