After social media analytics I signed up for this Big Data Analytics programme. Also by Queensland University of Technology, it is a series of four courses covering the collecting, storing and managing (week 1); statistical inference and machine learning (week 2); mathematical modelling (week 3) and data visualisation (week 4).
(As an aside for anyone else out there who is juggling parent / caring responsibilities with attempts at professional development, this programme took me considerably longer than four weeks to complete, which was fine, and it is possible to do each course as a stand-alone).
I approached this programme with more trepidation that the previous courses. I was completely unfamiliar with the concepts, language, software, and I have no experience on a team where big data analysis was part of the work. I don’t know how to program and my maths was never that advanced. This was about taking a deep breath and diving in, remembering that all I am trying to do is get a broad overview and basic understanding of what big data analytics is all about – enough to help me identify the path I might want to take in more detail.
The first two-week course in the programme introduced data management and computation – how to collect, share and manage big data. Week 1 takes you through an introduction to big data, with links to a wide range of additional reading, some of which was very technical (I link to some of the most useful and accessible papers in the post ‘What is big data?’. And provides some examples of big data in action, from helping to protect the Great Barrier Reef, to public health issues like cancer, and the more familiar analytics for business examples.
The course leaders also get you thinking about some questions, and these would be useful to share with clients or colleagues interested in working with big(ger) data.
• What does big data mean from your perspective, and why?
• What big data analyses do you currently conduct now?
• What data analysis do you, or your organisation, currently conduct that could become difficult to conduct in the future (for instance, if the data you use was 100 times larger)?
• What big data analysis or questions would your organisation like to explore, but can’t now?
The emphasis on proper strategic planning for big data analysis is very similar to planning an evaluation (you can play ‘snap’ with the questions in this post).
• What is the purpose of the data (do we want to explore the data or are we looking to use it for a specific purpose or decision)?
• What data are available (quality, sources, gaps etc.)?
• How can we best capture the data (to be most time & cost effective, to make sure we get useful data)?
• How will we manage the data?
• How will we analyse the data?
• How will we interpret and visualize the data (important that the final users of the analysis and data scientists work together, for mutual understanding of the wider context and the statistical uncertainty in the analysis)?
When we look at the detail of the ACEMS Big Data Wheel, we start to see the number of steps that each involve specific, technical choices – about software, algorithms, maths and stats for analysis, storage, speed etc.
acems-big-data-wheel
Most of the first week was well pitched as an introduction for a total beginner to big data analytics.The last few tasks of the course get more technical, and I could have done with a glossary or more basic description of the different tasks performed by the software we were introduced to. As a minimum, this description from the start of week 2 would have been useful to see at this, earlier, stage:
“Big data management is a pie with many slices; that is, it’s not just one big step but a combination of many steps. For each of the steps on the pie, there is an explosion of methods, software products and software platforms.”
And this article – 8 page of easy to read examples from each slice of pie– is a good mental warm up before having a go yourself.
We were introduced to Apache Hadoop software, for the distributed processing of massive unstructured data. At this point I understood the Hadoop Distributed File System (HDFS) as the framework for storing big data, without needing a great big computer, and as a way of working with unstructured data, and MapReduce as the tool for moving data around to answer queries. We were also briefly introduced to SQL, (structured query language), the traditional programming language for data that is organised (or structured) in a database.
But none of this made much sense through reading alone, you just have to start doing it. And, without having a real understanding of what I was doing, by end of week one I had loaded all the necessary software onto my laptop (no small task in itself, as I discuss here) and could move and delete files in Hadoop, ready to upload data to HDFS and compile and run MapReduce jobs on the data in week 2.