Week one of the course introduced Apache Hadoop software, used for distributed processing of massive unstructured data, week two took us deeper into accessing and operating on big data and took me far, far away from my comfort zone. The least geeky explanations of Hadoop and MapReduce that I could find are here and here (in the section ‘How to make sense of big data’).
As a total beginner, I found the overviews in week 2 too technical to understand, but by the end of the course I had a basic understanding of putting data somewhere and moving it about. Also, how to reorganise to quickly query the data. This was very much a theoretical understanding, not something I could start applying to my own data.
Before starting on this or a similar introductory course a total beginner might be aided by these tips:
Remind yourself that this course is not about getting to grips with the detail of all the different tools and software, which are constantly evolving. Unless you already have some experience, aim for a broad and general understanding of the tasks involved in operating on big data.
Do all the practical exercises anyway. I have never used Java programming language before, and only had the most basic grasp of what was going on, but I still did all the practical exercises using the step-by-step instructions. It was only through using Hadoop and MapReduce that I really began to get a feel for what accessing and operating on big data means, just reading about it was too abstract. Whilst I was not learning about a specific task, I was learning about the nature of the data and how to work with it. And it is all useful practice in working with different programming languages and software.
Focus on the practical exercises intensively, rather than dipping in an out. It is easy when you can get a bit of a flow going.
Visualise behind the scenes. There are similarities between the tasks we perform as researchers, and the tasks to perform in big data analytics – clean, store, access, quality assure, future proof – but the operational nature of these tasks is very different. I was using the software without really understanding what is going on behind the scenes. I found a couple of analogies useful. Engineering, with the need to balance the ‘weight’ of data, and distributing loads across frameworks. Also, traffic planning, where the data is the traffic. You are trying to get the traffic to where it needs to be without coming to a standstill because of clogged up roads or accidents. Neither do you want any of the traffic to get lost, end up in the wrong location, or have to waste time doing their journey twice.
The basic concept for MapReduce is a method for distributing a task across multiple nodes for parallel data processing. When you have big data that would be too slow to process in a linear way you can split it up and have it processed in different locations concurrently, then bring it all back together again to give you the answer.
I uploaded data to HDFS, compiled and ran a MapReduce job and viewed the results, to count words in a text file. That task is relatively easy, counting how often each word appeared in bodies of text; something you could do in an excel spreadsheet using ‘=SUMPRODUCT’. But MapReduce was doing this very fast with masses of unorganised data, not a structured excel spreadsheet.
I started to think about where this would be useful. Imagine evaluating a large service, with data to triangulate from free text survey responses, interviews, observations and comments on social media. A word count that could quickly be executed across all of these sources might helpfully contribute to the development of a coding framework for analysis and to run an initial cut of the data against certain keywords.
For me, this is still hypothetical. At this stage, I don’t know enough to know if this kind of thing would be possible or useful, let alone how to do it. I didn’t come away from this first course in the programme with any new practical skills that I could apply on a project. But the basic overview of concepts and processes was worth the investment of time (about 4 hours). I now know that this kind of thing might be possible on a project and I could have a useful conversation with an expert about taking it forward.
After this first course in the four-course programme on big data analytics I felt I would benefit from a broader overview before I continued with course two, on statistical inference and machine learning. Something less focused on specific tools and technical challenges, and written for an intelligent non-expert. Fortunately, I found it.