What is Amazon EMR?

An Amazon EMR stands for Amazon Elastic MapReduce. It is a web service used to process the large amounts of data in a cost-effective manner. The central component of an Amazon EMR is a cluster. Each cluster is a collection of EC2 instances and an instance in a cluster is known as node. Each node has a specified role attached to it known as a node type, and an Amazon EMR installs the software components on node type.

Following are the node types:

  • Master node
    A master node runs the software components to distribute the tasks among other nodes in a cluster. It tracks the status of all the tasks and monitors the health of a cluster.
  • Core node
    A core node runs the software components to process the tasks and stores the data in Hadoop Distributed File System (HDFS). Multi-node clusters will have at least one core node.
  • Task node
    A task node with software components processes the task but does not store the data in HDFS. Task nodes are optional.