Part 1: A Gentle Introduction to MapReduce

Last Update: April 2018

Abstract

This series introduces the MapReduce (MR) paradigm and guides the reader through many steps ranging from the simplest case: one core, one thread, one machine, and one unit of data; to the most complex case: multi cores, multi threads, multi machines, and M >> [1] N units of data.

This article is an introductory text with a pragmatical approach to MR in which the basics and the motivations of this distributed programming paradigm are introduced. In addition, we start implementing our first MR program using the Python language and its built-in mapreduce framework.

1. Introduction

1.1. What is MapReduce?

Generally one can find definitions like: "[...] MR works by breaking the processing into two phases: the map phase and the reduce phase [...] Each phase has a key-value pairs as input and output [...] The programmer specifies two functions: the map function and the reduce function [...]" [Whi15]; or like: "MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster [...]" [Wik17b].

Well having said that, you are not so very far in your MR understanding than you were before reading it! And it is normal since such definitions do not target newbies.

First of all, MR is a programming model (e.g. like object or aspect orientation) and an associated implementation. This latest point is the essential difference between MR and classical programming paradigms. Indeed, without the associated implementation, MR is just as efficient than divide-n-conquer. Then you should retain that there are as much MR than there are implentations. But, all share together a core semantic that we'll expose here.

In order to well expose MR, let us take a glance at its history first...

1.2. An illustrious Ancestor

In contrary to what many people think and say, MR was not invented first by Google Inc [2]. Google has invented an efficient distributed computing system based on MR to process unlimited amount of data [Inc04]. Remember the just aforementioned associated implementation aka api, framework.

Most of experts all agree to locate the invention of MR concept in the earlier versions of the Lisp programming language in 1958 [Wik17a]. So yes, according to me the MR ancestor is one early feature of the Lisp programming language.

1.3. Nowadays

Modern MR comes from a more general problem solving paradigm named split-apply-combine strategy [Wic17], or simply Google MapReduce [Inc04] . Both are simply an application of one of the oldest world strategy: break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Thus, whenever one meets a problem that can be decomposed into independent subproblems, and the quantity of data on which the problem runs seems to be larger than one PC memory, then one should/MUST (automatically) think about MR.

Question: How to know if my problem is MR-Ready or not?

The response holds into 4 letters: data. Yes, all the (computer) problems receive data as input to produce (or not) another set of data. Whenever the data in input is breakable into several independent subsets, then you are in the right direction because it means that the data (or at least on a per group basis) are independent the one from the others (regarding one or several tasks that compose the final solution to the problem to be solved).

Then after, the nature of the processing to apply on the data should be considered. If the goal can be reached by applying small processes/tasks on individual element or group of elements, and generating intermediate elements that can be aggregated together in a final result; then you are definitely in a MR-Ready problem.

1.4. The Future

Some of you have already been told that MR was now obsolete because of many buzz new technologies like Apache Spark, and/or Apache Flink. Right, this is not true! The proof is that big players like Google still always use MR internally in parallel to these new programming paradigms.

The reason is simple: volume. MR is the best technology to handle (very, very) large volumes in batch processing. Consequently, if your company runs very huge volumes, then MR is certainly the best (simplest) solution available to safely handle them.

In conclusion, investing time to learn MR is not worthless. It is an investment in the future, since all the new comers like Apache Spark, still provide some features that use MR paradigm (e.g. reduceByxxx).

1.4. Outline

The outline of this paper is as follows. In section 2, the word count problem is implemented and explained. In section 3 a conclusion about this first in-depth contact with MR is discussed.

Take Away

After reading this article, the reader will know how to:

think in map and reduce as well as other developers can think in objects, methods, or aspects, and so forth
model a problem in term of map and reduce functions,
know how to implement a simple MR problem into Python: from scratch and by using built-in functions.

1.5. The Source Code

The source code for this article is available at [mapreduce-python] .

2. Wordcount Or The MapReduce Hello World

2.1. The Use Case

The Hello World exercise for distributed computing in general consists to count the number of words of a huge text that cannot fit into a single PC memory. However, one can start with a text that can fit in a single PC memory and assumes it cannot. Then one writes a program that applies MR techniques to count the number of words in this text.

This is exactly what we are going to do using MR. But don't be afraid, we'll go progressively from collocated map and reduce functions to distributed ones.

Reading this, the following concepts have to be clarified:

the quality of the text, and
the definition of "word" in our context

In order to simplify the context and to avoid the implementation of a complex sanitization function (which is out-of-the-scope of this paper), we assume the input text contains only English words, punctuations, digits and/or decimal numbers. The word concept in our context also encompasses - in addition to classical English words - numbers, and digits. Punctuations signs are not considered as words.

For example, the longest sequence of words we can extract from the following text:

So, what's going on here? Well, there's three steps to this process.

is

['So', 'what', 'is', 'going', 'on', 'here', 'Well', 'there', 'is', 'three', 'steps', 'to', 'this', 'process']

2.2. The Text

The text used to test our very first version of MR is as follows:

So, what's going on here? Well, there's three steps to this process. First of all, the Map function receives independent lists of tokens from the source text. In parallel, it will attempt to strip off any leading or trailing p
unctuation, and if it is a well formed word, adds it to the results list with a count of 1. All such tokens are processed in this way, and duplicates are also appended to this list.

After the Map operation is complete, the program needs to organize all the processed data. This is where the partition 
function comes in. This function receives a huge list of lists; each constituent list was processed by an instance of Map and thus contains a sequence of (token, 1) pairs. This function organizes all the data into a dictionary
, so that now each token key has a list of (token, 1) tuples as a value. 

The final step is in the Reduce function. It receives arbitrary key:value pairs from the dictionary built-in partition, which it uses to generate the final words count. In parallel, the list in each dictionary entry is collaps
ed, summed, and the final count for each token key is returned by each instance of Reduce. 

The final result is a list of (token, count) tuples, where count refers to the total count of the token over the entire document.

Let's get to the Python multiprocessing code, and the other stuff we need to make the above functions work. I'll create a worker pool of eight processes, and distribute the above workload to those processes via the Pool.map fu
nction.

As you can see, this text speaks about MR implementation and it could have been itself a fragment of the current article. However, the clever reader should has already noticed that this simple text contains some traps that must be handled with care in real world Natural Language Processing (NLP) problem. For instance the antepenultimate word Pool.map with its dot sign will be interpreded as a end of sentence; consequently as two words. Which is not exactly the truth.

2.3. The Implementation

The first implementation we'll study is highly inspired from [Cve10] and [Nie09]. The Python language is used again regarding its high prototyping capacity. In addition, it offers a basic MR framework [python-MR] we rely on. However, it is easy to craft its own MapReduce "framework" by providing a bespoke implementation of both map() and reduce() functions as explained in [Nie09].

Let's start with tool functions that are used by the mappers and reducers to mapreduce paradigm.

2.3.1. The Tokenization

The load() function reads a text file and splits it into tokens using white spaces as delimitor. Then, many tokens are a concatenation of a real word and a separator or a punctuation mark: "world,". In addition, many composite words deserve to be split into several ones: "what's" should be split into "what" and "is". This latest task is handled by the sanitization code presented below in the document.

import os
import sys ; sys.stdout.flush()
import itertools

def load(path):
    word_list = []
    f = open(path, "r")
    for line in f:
        word_list.append (line)
    return (''.join(word_list)).split ()

For the 1st test, we load a single sentence:

# test 1: The sentence.txt file contains the example sentence of section 2.1.
l = load("./sentence.txt")
print l

# result:
['So,', "what's", 'going', 'on', 'here?', 'Well,', "there's", 'three', 'steps', 'to', 'this', 'process.']

# test 2: The text.txt file contains the whole text of section 2.2.
ll = load("./text.txt")
print ll

# result:
['So,', "what's", 'going', 'on', 'here?', 'Well,', "there's", 'three', 'steps', 'to', 'this', 'process.', 'First', 'of', 'all,', 'the', 'Map', 'function', 'receives', 'independent', 'lists', 'of', 'tokens', 'from', 'the', 'source', 'text.', 'In', 'parallel,', 'it', 'will', 'attempt', 'to', 'strip', 'off', 'any', 'leading', 'or', 'trailing', 'punctuation,', 'and', 'if', 'it', 'is', 'a', 'well', 'formed', 'word,', 'adds', 'it', 'to', 'the', 'results', 'list', 'with', 'a', 'count', 'of', '1.', 'All', 'such', 'tokens', 'are', 'processed', 'in', 'this', 'way,', 'and', 'duplicates', 'are', 'also', 'appended', 'to', 'this', 'list.', 'After', 'the', 'Map', 'operation', 'is', 'complete,', 'the', 'program', 'needs', 'to', 'organize', 'all', 'the', 'processed', 'data.', 'This', 'is', 'where', 'the', 'partition', 'function', 'comes', 'in.', 'This', 'function', 'receives', 'a', 'huge', 'list', 'of', 'lists;', 'each', 'constituent', 'list', 'was', 'processed', 'by', 'an', 'instance', 'of', 'Map', 'and', 'thus', 'contains', 'a', 'sequence', 'of', '(token,', '1)', 'pairs.', 'This', 'function', 'organizes', 'all', 'the', 'data', 'into', 'a', 'dictionary,', 'so', 'that', 'now', 'each', 'token', 'key', 'has', 'a', 'list', 'of', '(token,', '1)', 'tuples', 'as', 'a', 'value.', 'The', 'final', 'step', 'is', 'in', 'the', 'Reduce', 'function.', 'It', 'receives', 'arbitrary', 'key:value', 'pairs', 'from', 'the', 'dictionary', 'built-in', 'partition,', 'which', 'it', 'uses', 'to', 'generate', 'the', 'final', 'words', 'count.', 'In', 'parallel,', 'the', 'list', 'in', 'each', 'dictionary', 'entry', 'is', 'collapsed,', 'summed,', 'and', 'the', 'final', 'count', 'for', 'each', 'token', 'key', 'is', 'returned', 'by', 'each', 'instance', 'of', 'Reduce.', 'The', 'final', 'result', 'is', 'a', 'list', 'of', '(token,', 'count)', 'tuples,', 'where', 'count', 'refers', 'to', 'the', 'total', 'count', 'of', 'the', 'token', 'over', 'the', 'entire', 'document.', "Let's", 'get', 'to', 'the', 'Python', 'multiprocessing', 'code,', 'and', 'the', 'other', 'stuff', 'we', 'need', 'to', 'make', 'the', 'above', 'functions', 'work.', "I'll", 'create', 'a', 'worker', 'pool', 'of', 'eight', 'processes,', 'and', 'distribute', 'the', 'above', 'workload', 'to', 'those', 'processes', 'via', 'the', 'Pool.map', 'function.']

2.3.2. The Sanitization

Before diving into the algorithm part, let us consider the "sanitization" operations required to clean-up the input document. Functionally speaking the sanitization operations consists to:

remove some punctuations
transform all the ’s suffixes into is words,
transform all the ’ll suffixes into will words,
transform all the ’d suffixes into would words,

Later the reader can extend the sanitization scope and add more features like the handling of ellipsis, hyphens, and so forth. The sanitization is implemented as being either local (e.g. at word level), or global (e.g. at document level). The local part is implmented by the sanitize_word() function, and the global one by the sanitize() function.

def sanitize_word(w):
    if not w.isalnum():        
        # ex. what's, ...
        i = w.find("'")
        if i != -1:
            p1 = w[0:i]
            if w[i+1:] == 's':
                p2 = 'is'
            elif w[i+1:] == 'll':
                p2 = 'will'
            elif w[i+1:] == 'd':
                p2 = 'would'
            # put other transformation rules here ...
            
            return [p1,p2]
        else:
            if len(w) > 0:
                # Strip punctuation from the front or the back
                if not w[0].isalnum():
                    return [w[1:]]
                if not w[-1].isalnum():
                    return [w[:-1]]
            else:
               return []
        # the word isalnum()
        return [w]

def sanitize(W):
    # to return a 2D list
    return [sanitize_word(w) for w in W]


# tribute: [Stack14681609]
def to_2D(l):
    return [l[i:i+1] for i in range(0, len(l), 1)]            

def to_1D(l):
    return list(itertools.chain.from_iterable(l))

# test:
print sanitize_word("here's")
print sanitize_word("I'll")

# result:
['here', 'is']
['I', 'will']

# test:
print sanitize(ll)

# result: 
[['So'], ['what', 'is'], ['going'], ['on'], ['here'], ['Well'], ['there', 'is'], ['three'], ['steps'], ['to'], ['this'], ['process'], ['First'], ['of'], ['all'], ['the'], ['Map'], ['function'], ['receives'], ['independent'], ['lists'], ['of'], ['tokens'], ['from'], ['the'], ['source'], ['text'], ['In'], ['parallel'], ['it'], ['will'], ['attempt'], ['to'], ['strip'], ['off'], ['any'], ['leading'], ['or'], ['trailing'], ['punctuation'], ['and'], ['if'], ['it'], ['is'], ['a'], ['well'], ['formed'], ['word'], ['adds'], ['it'], ['to'], ['the'], ['results'], ['list'], ['with'], ['a'], ['count'], ['of'], ['1'], ['All'], ['such'], ['tokens'], ['are'], ['processed'], ['in'], ['this'], ['way'], ['and'], ['duplicates'], ['are'], ['also'], ['appended'], ['to'], ['this'], ['list'], ['After'], ['the'], ['Map'], ['operation'], ['is'], ['complete'], ['the'], ['program'], ['needs'], ['to'], ['organize'], ['all'], ['the'], ['processed'], ['data'], ['This'], ['is'], ['where'], ['the'], ['partition'], ['function'], ['comes'], ['in'], ['This'], ['function'], ['receives'], ['a'], ['huge'], ['list'], ['of'], ['lists'], ['each'], ['constituent'], ['list'], ['was'], ['processed'], ['by'], ['an'], ['instance'], ['of'], ['Map'], ['and'], ['thus'], ['contains'], ['a'], ['sequence'], ['of'], ['token,'], ['1'], ['pairs'], ['This'], ['function'], ['organizes'], ['all'], ['the'], ['data'], ['into'], ['a'], ['dictionary'], ['so'], ['that'], ['now'], ['each'], ['token'], ['key'], ['has'], ['a'], ['list'], ['of'], ['token,'], ['1'], ['tuples'], ['as'], ['a'], ['value'], ['The'], ['final'], ['step'], ['is'], ['in'], ['the'], ['Reduce'], ['function'], ['It'], ['receives'], ['arbitrary'], ['key:value'], ['pairs'], ['from'], ['the'], ['dictionary'], ['built-in'], ['partition'], ['which'], ['it'], ['uses'], ['to'], ['generate'], ['the'], ['final'], ['words'], ['count'], ['In'], ['parallel'], ['the'], ['list'], ['in'], ['each'], ['dictionary'], ['entry'], ['is'], ['collapsed'], ['summed'], ['and'], ['the'], ['final'], ['count'], ['for'], ['each'], ['token'], ['key'], ['is'], ['returned'], ['by'], ['each'], ['instance'], ['of'], ['Reduce'], ['The'], ['final'], ['result'], ['is'], ['a'], ['list'], ['of'], ['token,'], ['count'], ['tuples'], ['where'], ['count'], ['refers'], ['to'], ['the'], ['total'], ['count'], ['of'], ['the'], ['token'], ['over'], ['the'], ['entire'], ['document'], ['Let', 'is'], ['get'], ['to'], ['the'], ['Python'], ['multiprocessing'], ['code'], ['and'], ['the'], ['other'], ['stuff'], ['we'], ['need'], ['to'], ['make'], ['the'], ['above'], ['functions'], ['work'], ['I', 'will'], ['create'], ['a'], ['worker'], ['pool'], ['of'], ['eight'], ['processes'], ['and'], ['distribute'], ['the'], ['above'], ['workload'], ['to'], ['those'], ['processes'], ['via'], ['the'], ['Pool.map'], ['function']]

As you guessed, the sanitize function is used on the output of the load() function. In addition, as illustrated by the test; the result is a 2D list (e.g. a list of lists) where each word is a singleton list. Notice: The to_2D and to_1D functions will be used later in the document. But we provide unitary tests right now form them.

# test:
to_2D([1, 2, 3, 4, 5])

# result:
[[1], [2], [3], [4], [5]]

# test:
to_2D([[1], [2], [3], [4], [5]])

# result:
[[[1]], [[2]], [[3]], [[4]], [[5]]]

# test:
to_1D([[1], [2], [3], [4], [5]])

# result:
[1, 2, 3, 4, 5]

2.3.3. Thinking in MapReduce

In order for the legacy developers not to be lost, and for the new ones to learn something; let us make the join between this new programming model and the concept of Business Rule (BR) (e.g. the legacy development unitary brick). Thus, programming in MapReduce consists to implement the BR with mainly 2 functions f1 and f2 that will be executed respectively by the map and the reduce phase. Later we'll replace the word phase by function; but for the moment let's speak about phase. In this article, the input dataset is reduced to a single textual document. This document is converted into a single (2D) list of words (e.g. sanitized tokens). Thus, the busness here consists to evaluate the size of this list. The following function len() does correctly the job:

W = ['So', 'what', 'is', 'going', 'on', 'here', 'Well', 'there', 'is', 'three', 'steps', 'to', 'this', 'process']
len(W) # produce 14

2.3.3.1. The Business Rule

But the objective here is to implement this business (rule) using map and reduce phases with two more functions that really implements the BR. The definition of map phase/function in [python-MR] starts with "Apply function to every item of iterable and return a list of the results..." and it ends with "... the result is always a list." The definition of reduce() function starts with "Apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value...". We have now our road being enlightened from the darkness and we have a more clearer definition of how the functions f1() and f2() will look like.

Question: What is the f1() function that will be executed by the map() one?

f1 input: Well, we have a list of list of words (remember above the output of the sanitize() function). Then, the item (in the aforementioned map() definition) here is a (sub)list: ['Well'] or ['there']. Then the input of f1() function is a list.
f1 output: Now let us find out the output of f1(). The Y such that f1([['Well']]) = f1(['Well']) = Y or f1([['there'], ['is']]) = Y. Somewhere in our mind, keep holding target or business goal: counting the number of words of a list. A list can be broken into sublists and the number of words of the list is the sum of the number of words of all sublists. Then Y here can (reasonably) be the number of words of the (sub)list X. Then f1(['Well']) = 1 and f1([['there'], ['is']]) = 2.

Question: What is the f2() function that will be executed by the reduce() one?

f2 input: As suggested by its definition, the reduce() phase/function accepts an iterable as second parameter: a list of items. In addition f2() should be a binary function (e.g. with 2 operands). Since map() is applied "before" reduce, then the input of reduce() is the output of map() which is the application of f1() to a whole list (of sublists). Then the output of map() is something like [f1(['word1']), f1([['word1'], ['word2']]), ..., f1(['wordN'])] = [1, 2, ...,1].
f2 output: Finally if f2() should produce the total number of words of the list, then f2() (combined to reduce()) should apply a cumulative sum of all items of the map() output.

Implementation Let's now formalize both f1 and f2 function in term of lambda expressions

f1 = lambda x: len(x)
f2 = lambda x,y: x+y

Note: The reader should notice that the implementation as lambda function is not mandatory at all. It is simply a matter of elegance and that all.

2.3.3.2. An Example of MapReduce Driver Program

Now we have implemented our BR functions; before writting a driver program for the wordcount problem, let us write a (simplistic) MapReduce driver program that orchestrates the map and reduce phases/functions to count the number of elements distributed into several lists of integers.

from functools import reduce

f1 = lambda x: len(x)
f2 = lambda x,y: x+y

def main():
    a = [1, 2, 3]
    b = [4, 5, 6, 7]
    c = [8, 9, 1, 2, 3]
    # what is the total number of items in the a+b+c?
    L = map(f1, [a, b, c])
    print("[INFO] After the map phase: {}".format(L))
    N = reduce(f2, L)
    print("[INFO] After the reduce phase: {}".format(N))

main()

# result:
[INFO] After the map phase: [3, 4, 5]
[INFO] After the reduce phase: 12

Exercise: Adapt the previous driver and map function to produce a program that outputs the cumulative sum of all the numbers of several list of integers.

After the execution of the map built-in function, the numbre of items in each sublists is raised. This function simply applied the f1 function to all elements of the list. Then thhe reduce built-in function will apply cumulatively the f2 function to all pair of elements of the input list: (1) to (3,4) = f2(3,4) = 7, and (2) to (7,5) = f2(7,5) = 12. Then you have an idea of the redy-to-use built-in mapreduce framework provided off-the-shelf by the Python language. In the file simple_MR.py of code source available [here] the reader will find other examples that illustrate the application of same algorithm on several other forms of lists (e.g. list with sublists of words and integers). The following example illustrates the application of our algorithm on a splitted sentence:

def main():
    W = to_2D(['So,', 'what', 'is', 'going', 'on', 'here', 'Well,', 'there', 'is', 'three', 'steps', 
                   'to', 'this', 'process'])
    print("[INFO] W = {}".format(W))
    L = map(f1, W)
    n = reduce(f2, L)
    print("[INFO] We use MR to count the number of elements of W = {}".format(n))

main()

# result:
[INFO] W = [['So,'], ['what'], ['is'], ['going'], ['on'], ['here'], ['Well,'], ['there'], ['is'], ['three'], ['steps'], ['to'], ['this'], ['process']]
[INFO] We use MR to count the number of elements of W = 14

Notice here the usage of the tool function to_2D which is necessary to to convert the initial list of tokens into something that is meaningful (e.g. functionally and not technically) to the BR f1 function - which returns the length of the argument. Then, if we want to count the number of tokens, then this function should receive each token as a singleton list. If not, this is what happens:

def main():
    W = ['So,', 'what', 'is', 'going', 'on', 'here', 'Well,', 'there', 'is', 'three', 'steps', 'to', 
        'this', 'process']
    print("[INFO] W = {}".format(W))
    L = map(f1, W)
    N = reduce(f2, L)
    print("[INFO] We use MR to count the number of elements of W = {}".format(N))

main()

# result:
[INFO] W = ['So,', 'what', 'is', 'going', 'on', 'here', 'Well,', 'there', 'is', 'three', 'steps', 'to', 'this', 'process']
[INFO] We use MR to count the number of elements of W = 55

Here, the number of characters is provided: since each word is casted automatically into a list of chars because the built-in function len() requires an iterable. Thus, the nulber of characters is returned. To finish this section, we provide a simple adaptation to our context using a complete driver that will load our text file, santizes it, and applies simple mapreduce algorithm to count the number of words. Confer to one_process_MR.py file:

from functools import reduce

def main(path):
    print("[INFO]")
    print("[INFO] one_process_MR.py")
    print("[INFO]")
    W = load(path)
    W = sanitize(W)
    # mapreduce
    N = reduce(lambda x,y:x+y, map(lambda x:len(x), W))
    print("[INFO] nb words for document [{}] = {}".format(path,N))
    
main("./text.txt")

# result:
[INFO]
[INFO] one_process_MR.py
[INFO]
[INFO] nb words for document [./text.txt] = 270

3. Conclusion

In this article the very basics of MapReduce programming paradigm are introduced. One of the first lessons to learn is that mapreduce is always associated to an underlying implementation framework: here the Python built-in map and reduce functions formed this underlying framework. At first glance one can think it is a lot of code to simulate the Python built-in len() function that can be use to count the number of items of a list. But we'll see soon that this built-in function cannot deal with huge lists that don't fit into the memory of a single computer. In addition, this first version is functional; but not very efficient. Indeed, the program essentially performed within a single (synchronized) process: it counts the words of the 1st sentence, then the words of the 2nd sentence, and so forth. After this, we apply a cumulative sum, of the 1st and the 2nd result, and we obtain an intermediate sum which added to the 3rd result, and so forth. In the next article we'll see how to optimize this naive approach by exploiting all the cores we have in our CPUs.

Footnotes

[1]	M being very very large compared to N. With several order of magnitude.

[2]	The French version of the Wikipedia page for MapReduce [Wik17c] states that Google was the inventor of MR. This is wrong and the reader can simply read the content of the US patent granted to Google for its invention around MR [Inc04].

Bibliography

[Inc04]

Google Inc. System and method for efficient large-scale data processing. us 7650331 b1. 2004. https://www.google.com/patents/US7650331.

[Nie09]

Michael Nielsen. Write your first MapReduce program in 20 minutes. http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/, 2009.

[Cve10]

Michael Cvet. Parallel MapReduce in Python in Ten Minutes. https://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/, 2010.

[Whi15]

Tom White. Hadoop: The Definitive Guide. Fourth Edition. O’REILLY, 1005 Gravenstein Highway North, Sebastopol, CA 95472, 2015.

[Wic17]

Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 2017. https://www.jstatsoft.org/article/view/v040i01.

[Wik17a]

Wikipedia. Lisp. 2017. https://fr.wikipedia.org/wiki/Lisp.

[Wik17b]

Wikipedia. Mapreduce. 2017. https://en.wikipedia.org/wiki/MapReduce.

[Wik17c]

Wikipédia. Mapreduce. 2017. https://fr.wikipedia.org/wiki/MapReduce.