101 Hadoop Interview Questions with Answers

101 Hadoop Interview Questions with Answers

Hadoop Is the trending technology with many subdivisions as its branches. 101 Hadoop Interview Questions with Answers are divided with HDFS questions, Map reduce questions, HBase questions, SQoop questions, flume questions, Zookeeper interview questions, pig questions, hive questions and yarn questions. To help the students from the interview point of view, our Big Data Training professionals have listed down the 101 interview questions.

  1. Explain about Hadoop streaming?

Mapper or reducer are used to create or run jobs using a generic application programming interface with a programming language like Python, Perl, and ruby etc. This is called Hadoop streaming.

  1. Explain the best hardware configuration to run Hadoop?

The ECC memory is the greatest advantage of Hadoop and users have experienced errors by using the non-ECC memory. The hardware configuration depends upon the workflow requirement and memory. Hadoop jobs with dual-core machines or dual processors with 4GB or 8GB RAM uses the ECC memory and ECC memory is the best configuration for executing Hadoop jobs.

  1. List out the common input formats in Hadoop?

Text input format, Key value input format, and sequence file input format are some of the common input formats in the Hadoop.

  1. List out the steps involved in the big data solution?

Data ingestion, data storage, and data processing are the three steps involved in the big data solution. To extract the data there are different sources available like SAP, CRM, log files, flat files, documents, images, social media feeds and RDBMS like MySQL or Oracle. Data can be ingested through batch jobs and real-time streaming. After extracting the data it is stored in HDFS or NoSQL database like HBase. After storage, the data is processed using MapReduce, spark, pig, and hive framework.

  1. What is the difference between HDFS and HBase?

HDFS store the data in sequential order whereas HBase works with reading or write access.

  1. What are the various factors to choose the file format in Apache Hadoop?

Schema, usage pattern with respect to a number of columns, split of data to process parallel, Storage space, and the performance of data like reading, write or transfer are some of the factors which influence the decision of the file format in Apache Hadoop.

  1. What are the different types of file formats used to store in the Apache Hadoop?

CSV, JSON, Columnar, Sequence files, AVRO and Parquet file are some of the files used in Apache Hadoop. Join the Hadoop Training in Chennai to prepare for the Hadoop interviews.

  1. Define block and block scanner in HDFS?

Block and block scanner works with the minimum amount of data which can be read or written in the HDFS. The size of a block is 64MB. Block scanner tracks the DataNode to checksum the errors.

  1. Explain the name node, backup Node, and the checkpoint name node?

NameNode manages the metadata which is saved in the directory tree of the HDFS file system on a Hadoop cluster. fsimage file and the edits file are the two in the NameNode. Backup Node keeps the up-to-date in-memory copy of the file which is in sync with the active NameNode. Checkpoint Node creates the checkpoints at regular intervals. Checkpoint Node downloads the edits and fsimage file from the NameNode and merges it locally. The image in the active NameNode is updated back after the Checkpoint Node.

  1. Explain the term ‘commodity hardware’ in Hadoop?

Commodity hardware consists of RAM to execute the specific services and Hadoop does not require high-end hardware to configure.

  1. Write the port number for the name node, task tracker, and the job tracker?

The port number for name node, task tracker, and job tracker are 50070, 50030 and 50060 respectively.

  1. Describe the process of inter-cluster data copying?

The distributed data from the source to the destination works with DistCP in the Hadoop. The data copying within the Hadoop cluster is called as inter-cluster data copying. The version control is maintained by the DistCP.

  1. Explain secondary NameNode?

Secondary NameNode performs the checkpoints in HDFS. It takes the Metadata file in the NameNode and merges the file with the FSImage to produce the new image. By this function, the edit logs stop from becoming too large.

  1. What are the files associated with the metadata?

The files associated with metadata are FSImage and Editlogs.

  1. Is it a challenging task to save lots of small files in HDFS?

Yes, because the RAM in the HDFS saves the metadata of the files. If there are so many small files then it is difficult to store as the space in the RAM is 150 bytes. The number of files and the data in the Hadoop is restricted in some cases.

  1. How to check that the NameNode is working?

The common method to check whether the NameNode is working is the jps command. Jps Command checks the status of the daemons running in the HDFS.

  1. Is that possible to copy a file into the existing HDFS with a different block size?

Yes, it is possible by using ‘-Ddfs.blocksize=block_size’ where the block_size is specified in bytes. To change the block size from 120 MB to 32 MB with the command: Hadoop fs -Ddfs.blocksize=33554432 -copy from local/home/fita/test.tst/sample_hdfs and to check the block_size with the command Hadoop fs -stat%0/sample_hdfs/test.txt.

  1. What is the rack awareness algorithm and why is it used in Hadoop?

Rack awareness algorithm is used in the Hadoop to improve the network performance and to prevent the loss of data due to network failure. This algorithm helps to manage the traffic and improves the performance.

  1. Is that possible to edit the file present in the HDFS?

No, It is not possible to modify the HDFS because it follows the write once and read many systems.

  1. If you want to read a file which is opened for writing in HDFS what is the procedure?

Yes, it is possible to read a file which is already open. The hflush operation in the HDFS push all the data in the write pipeline and it wait for the acknowledgments from the data nodes. So after the hflush the data written in the file is visible to the readers through the file is open for writing.

  1. Explain the data integrity in HDFS?

The data integrity in the HDFS talks about the correctness of the data. The read operation and block scanner verify the correctness of the data stored in the HDFS periodically.

  1. What are the uses of context object?

Context object consists of the configuration details for the job and it interacts with other Hadoop systems. This is used for updating counters, report the progress and to provide the status of the application. Join the Hadoop Training Chennai to hone the technical skills in the Hadoop technology.

  1. What are the three methods of the Reducer?

Setup (), reduce(), and cleanup() are the three methods of the Map reducer.

  1. What is the function definition of the three methods of the reducer?

Input data size, distributed cache, and heap size are the different parameters for the configuration in the reducer are grouped under the setup () function, reduce () is associated with the reduce task, and cleanup () is used for cleaning the temporary files.

  1. Explain the different phases in the Map-reduce?

Shuffle phase, sort phase, and partitioning phase are the three phases of the Map Reduce. To shuffle the map tasks after the first map task the nodes continue the several other map tasks, to sort the intermediate keys on a single node sort phase is used, and to process the intermediate keys and value to the reducer is called as partitioning phase.

  1. Write a custom partitioner for a HadoopMapReduce Job?

The steps to writing a custom partitioner are a new class is created, get partition method is decided, the custom partitioner is added to the config file in the wrapper in the MapReduce or else the set method is used to add the custom partitioner.

  1. What are the two side data distribution techniques?

Job configuration and distributed cache are the two side data distribution techniques. The first technique is used when the data is less than a few kilobytes for serializing the side data and the second technique is used for distributing under the cache mechanism.

  1. When is HBase used in the big data application?

If there is variable schema if the key based access is needed for the application and when data is stored in the form of collections the HBase is used for the big data application.

  1. What are the key components of the HBase?

The key components of the HBase are a region, region server, HBase Master, ZooKeeper, and catalog tables. These components aim towards the memory, monitor the region, monitor the region server, coordinate between the HBase master component and the client, storage and tracking of the regions in the system.

  1. Mention the operational commands in HBase at a record level and table level?

Put, get, increment, scan and delete are the record level commands and describe, list, drop, disable and scan are the table level operational commands.

  1. Explain the Row key?

Row key is used for grouping cells logically, locate the row keys on the same server and the row keys are internally regarded as a byte array. The unique identifier in the HBase table located in each row is called the Row Key.

  1. What is the difference between RDBMS data model and the HBase data model?

HBase is a schema-less data model whereas RDBMS is schema based. HBase has automated partitioning whereas RDBMS has no support for the partitioning. HBase stores de-normalized data whereas RDBMS stores normalized data.

  1. What are the two catalog tables in the HBase?

ROOT and META are the two important catalog tables in the HBase. The function of the Root table is tracking the META table and the function of the META table is it stores all the regions in the system.

  1. What is the column family?

Column family is the logical deviation of the data. If the compression feature is applied then the old data will remain the same whereas the new data will take the new block size. During the compaction, the old data will take the new block size and the existing data is read correctly.

  1. List out the difference between HBase and Hive?

HBase is NoSQL key-value store and Hive is for the SQL savvy people to run the MapReduce jobs. HBase is for the real-time querying whereas the Hive is for the analytical querying of data. HBase supports four primary operations such as put, get, scan and delete in the MapReduce jobs.

  1. What is the row deletion in HBase?

Data is not deleted only through the delete command in HBase rather it is invisible by setting a tombstone market. With the help of the compaction, the cells are deleted and removed.

  1. List out the different types of the tombstone markers in HBase for the deletion?

Family delete marker, version delete marker, and column delete marker are the three different types of tombstone markers in HBase for deletion.

  1. What are the functions of the different tombstone markers in HBase?

The functions are Family delete marker marks all the column for a column family, version delete marker marks only a single version of a column, column delete marker marks all the versions of a column.

  1. Describe the HLog and WAL?

Hlog saves all the edits in the Hstore whereas WAL writes the edits immediately. HLog contains the entries of the entire region server and every region server has one Hlog. WAL stands for the write-ahead log.

  1. What are the hallmark features of HBase?

Schema flexibility, scalability, and high reliability are the three features of the HBase.

  1. What are the advantages of using the HBase?

Triggers in the form of coprocessors, the coprocessors help to run the custom code on region server, the consistency is record level, and in-built versioning is the advantages of the HBase.

  1. What is CAP in HBase?

CAP is consistency, availability and partition tolerance. HBase provides features like partition tolerance and consistency and it is a column-oriented database.

  1. List out some of the other column-based databases?

CouchDB, MongoDB, and Cassandra are some of the other popular column based databases.

  1. List out the 18 filters in HBase?

Timestampsfilter, pagefilter, multiplecolumnprefixfilter, familyfilter, columnpaginationfilter, singlecolumnvaluefilter, rowfilter, qualifierfilter, columnrangefilter, valuefilter, prefixfilter, singlecolumnvalueexcludefilter, columncountgetfilter, inclusivestopfilter, dependentcolumnfilter, firstkeyonlyfilter, and keyonlyfilter are the 18 filters in HBase.

  1. List some of the commands in import and export?

Create job (–create), verify job (–list), inspect job (-show), and execute the job (–exec) are some of the commands in import and export.

  1. How to create a job in sqoop?

The table data is imported from the RDBMS to HDFS and a job is created with the name my job. The command to import the data is $ Sqoop job –create myjob \, –import \, –connect jdbc:mysql://localhost/db \, –username root \, –table employee –m 1.

  1. Describe verify job (–list)?

‘–list’ argument is used to verify the saved jobs and the command is $ Sqoop job –list.

  1. What is execute the job (–exec)?

‘–exec’ option is the squoop command used to execute a job in the $ Sqoop job –, create myjob \, –import \, –connect jdbc:mysql://localhost/db \, –username root \, and –table employee –m 1 are the commands to execute a job.

  1. Explain how java is used in the squoop database?

The squoop jar is the classpath in the java code. After the Java code the squoop.run.tool() methods must be invoked. As the command line the necessary parameters should be created in the squoop.

  1. Describe about the incremental data load in Squoop?

The delta data is the updated data or incremental data in squoop. The different attributes in the squoop are mode, COI and value which define the incremental data, check the column, and the last value in the squoop.

  1. What are the two types of incremental import using sqoop?

The two types of support for the incremental imports are append and last modified.

  1. What is the command for the standard location or path in the hadoop sqoop scripts?

/usr/bin/Hadoop Sqoop is the command for the standard location or path in the hadoop sqoop scripts.

  1. What is the command to check tables in a single database using Sqoop?

Sqoop list-tables –connect jdbc: mysql: //localhost/user; is the command to check the tables in a single database using the sqoop.

  1. How large objects are handled by sqoop in the hadoop technology?

The large object in “Lobfile” are supported by CLOB’s which is character large objects and BLOB’s means the binary large objects in the Hadoop.

  1. Explain how to use the SQL series with the Sqoop import command?

The SQL queries are used in the import command with -e and -query options to execute. The -target dir value must be specified in the import command.

  1. What is the difference between squoop and distcp?

Sqoop is used for the transfer of the data between the Hadoop and RDBMS and DistCP is used for the transfer of the data between the clusters.

  1. List out the limitations in the RDBMS tables into the Hcatalog directly?

The -hcatalog -database option is used to import the RDBMS tables into Hcatalog directly. -as-avrofile, -direct, -as-sequencefile, -target -dir, and -export-dir are not supported in the Hcatalog.

  1. How to transfer the data transfer utility squoop on an edge node?

The high data transfer volumes make the Hadoop services on the same node communicate with each other and the general suggestion is not to place squoop on an edge node. The messages are important for the hadoop service and the high data transfer could result in the whole node being cut off from the Hadoop cluster.

  1. What are the core components in flume?

The core components in Flume are Event, Source, Sink, Channel, Agent, and Client. Any JVM that runs the flume. The client is the component that transmits the event to the source that operates with the agent.

  1. How is the data reliability known in the Flume?

Yes, Apache flume provides the end to end reliability. It provides the reliability through the transactional approach in the data flow.

  1. How to use the Flume along with the HBase in the Hadoop technology?

Apache Flume can be used with HBase sink and AyncHbasesink. The version HBase 0.96 and the HBase clusters secure the HBasesink. AcyncHBasesink can easily make non-blocking calls to the HBase.

  1. Explain in detail about the working of the HBaseSink?

A flume event is converted into HBase increments or puts in HBaseSink, Sink is instantiated by the HBaseEventSerializer and it is implemented by the serializer. The sink calls the serializer to initialize the method which translates the flume event into HBase increments and puts to the HBase Cluster.

  1. Explain in detail the work of the AsyncHBaseSink?

The sink starts the initialize method and it is implemented by the AsyncHBaseEventSerializer. The setEvent method calls the getincrements and getActions methods. Similar to the HBase Sink the sink stops and the cleanup method is called the serializer.

  1. What are the different channel types in Flume?

MEMORY channel, JOBC channel, and the FILE channel are the different channel types in Flume.

  1. Explain the MEMORY Channel in the Flume?

The source provides the events to be read to the memory and then it is passed to the sink.

  1. Explain the JDBC Channel in Flume?

The events are stored in an embedded Derby database in the JDBC channel in the Flume.

  1. Which channel is the fast among the different channels in Flume?

MEMORY Channel is the fastest channel and it also has the risk of data loss. The channel that is used depends upon the nature of the big data application.

  1. Which channel is reliable in the Flume?

FILE channel is the reliable channel in the Flume.

  1. Describe the replication and multiplexing selectors in Flume?

Multiple channels are handled by the channel selectors. An event can be written to a single channel or multiple channels based on the flume. Replicating selector is a channel selector which is not specified to the source. In the source’s channels list the same event is written to all the channels in the replicating selector. The different events to the different channels are used by the multiplexing channel.

  1. Which helps for multi-hop agent setup in Flume?

The Avro RPC Bridge mechanism helps for the multi-hop agent set up in the Apache Flume.

  1. How to leverage the real-time analysis of the big data through Flume method?

The Apache solr servers using the morphlinesolrsink are used for the data extraction and transformation on the big data through flume method.

  1. What is the major difference between filesink and filerollsink?

In the File Sink the HDFS file sinks and writes the events into the Hadoop distributed file system whereas File roll sink stores the events into the local file system.

  1. Is it possible to use the Apache kafta without the ZooKeeper?

No, it is not possible to use the Apache kafta without Zookeeper. If the Zookeper is down then kafta cannot serve client request in the Hadoop technology.

  1. List some of the companies where they use the Hadoop Zookeeper?

Yahoo, Solr, helprace, Neo4j, and Rackspace are some of the companies where the Zookeeper is used for the database management.

  1. Explain the role of the Zoo Keeper in H Base architecture?

Zookeeper is the monitoring server that provides different services. The different services of the Zookeeper are tracking server failure, network partitions, maintaining the configuration information, establishing communication between the clients and region servers, the usability of the ephemeral nodes to identify the available servers in the cluster.

  1. Describe how Zookeeper works in Kafta?

Zookeeper is a highly distributed and scalable system in the Apache Kafta. To store various configurations in the kafta and use them across the Hadoop cluster Zookeeper is used. To achieve the distributed-ness, the configuration is used. Configurations are distributed and replicated throughout the leader and the follower nodes in ZooKeeper ensemble. Zookeeper and Kafta are inter-connected and if the Zookeeper is down then it will not serve the client request.

  1. How does Zookeeper work?

ZooKeeper is the king of the coordination and distributed applications. The ZooKeeper is used to store and facilitate the important configuration information updates. Zookeeper is a robust replicated synchronization service which coordinates with the process of the distributed applications. ZooKeeper is the process with eventual consistency. Zookeeper cluster is formed using three or more independent servers. The major node is selected by the ensemble. Writes and reads are linear and concurrent in the Zookeeper.

  1. Explain in detail about the command line interface in Zookeeper?

Zookeeper is used for the command line client support for the interactive use, After using the prompt messages, It is as like the directories in the zookeeper. The users can enter just enter a command to enter in to prompt views.

  1. What are the two types of Znodes?

The two types of Znodes are Ephemeral and sequential znodes.

  1. How the two types of Z nodes work?

Epermal Znodes are the Znodes that get destroyed as soon as the client disconnects and the sequential number is chosen by the Zookeeper and pre-fixed when the client assigns a name to the Znode is called as the sequential Znode.

  1. How to track events in Zookeeper?

If we want to track the Z nodes at regular intervals during the client disconnection then the watch is event system in the Z node which is used to trigger an event whenever it is removed or altered or any new children are created below it.

  1. What types of problems are addressed by using the Zookeeper in Hadoop?

Apache Zookeeper solves two types of major problems and they are synchronizing access to shared data and communicating information between processes. Creating own protocol for the coordinating the Hadoop cluster is the failure and creates frustration for the developer. Apache Zookeeper can be used as a coordination service for the distributed application.

  1. What are the different modes of execution in Apache Pig?

The pig command mode and the Hadoop map-reduce command mode are the two modes in the Apache Pig.

  1. Explain the different modes of execution in Apache Pig?

The local mode requires the access to a single machine and all the files are installed and executed on a local host. Map reduce access the Hadoop cluster for different modes of execution in Apache Pig.

  1. What is a co-group operator in pig?

Co-group operator is used for multiple tuples and Co group is applied to statements that contain or involve two or more relations. First pig joins both the tables and joins the table on the grouped columns. There are 127 relations on which the co group operator is applied.

  1. Describe the SMB join in Hive?

A merge sort join is performed after reading the first table in a mapper and the corresponding bucket from the second table to merge sort join. As there is no limit on file or partition or table joins in the hive the Sort Merge Bucket is used. If the tables are large then SMB is used to merge the columns and join the tables.

  1. How to connect an application in the Hive as a server?

The three ways to connect the Hive server are ODBC driver, JDBC driver, and thrift client.

  1. Explain in detail how an application is connected to the Hive?

ODBC driver is supported by ODBC protocol, JDBC driver is supported by JDBC protocol, and thrift client is used to making calls to all hive commands using a different programming language like PHP, Python, Java, C++, and Ruby.

  1. Explain in detail about the overwrite keyword denotes in Hive load statement?

The function of the overwrite keyword is it deletes the contents of the target table and it replaces them with the files referred by the file path. When using the overwrite keyword the files are added to the table which is referred by the file path.

  1. Explain SerDe in Hive?

SerDe is a serializer Deserializer and Hive use SerDE to read and write data from tables. Rather than writing the data the users prefer to write a deserializer instead of a serde as they want to read their own data format.  Rather than writing the SerDe from the scratch the protocol based DynamicSerDe is used to parameterize the columns and different column types.

  1. Mention the stable versions of Hadoop?

Release 2.7.1 {stable}, Release 2.4.1., and Release 1.2.1 {stable} are the stable versions of the Hadoop.

  1. What is YARN in Apache Hadoop?

YARN is a large scale distributed system and it is suitable for running the big data applications in Hadoop 2.0.

  1. What is the difference between YARN and Map Reduce?

YARN is a more powerful and efficient technology than Map Reduce and it is referred as hadoop 2.0 or Map Reduce 2.

  1. What are the benefits of YARN?

YARN is different from Hadoop and there is no fixed slot for the utilization of the resources. As the same container is used for the Map and the Reduce tasks the utilization of the resource is good, YARN is used for the applications not based on Map Reduce model, and YARN is backwards compatible for the MapReduce jobs.

  1. Explain about the two ways of including native libraries in YARN jobs?

The two ways are by setting the Djava.library.path on the command line but there are possibilities to get error in this case. The other way is to set the LD_LIBRARY_PATH in the bashrc file.

  1. List out the differences between the Hadoop1.x and Hadoop 2.X?
  • In Hadoop 1.X the Map Reduce is responsible for processing and cluster management where as in Hadoop 2.X the processing have been done by processing models and the cluster management is taken over by the YARN.
  • The scaling is high and Hadoop 2.x scales 1000 nodes per cluster.
  • If there is failure in Name node then it is recovered manually where as for the Hadoop 2.x overcomes the SPOF problem and the Name node failure with automatic recovery.
  • Hadoop 1.x works on concepts whereas Hadoop 2.x works on the containers and can run generic tasks also.
  1. What are the improvements in Hadoop2.o when compared to Hadoop1.0?

Hadoop 2.x is good in the resource management and the execution, the seperation of logic and MapReduce help for the distribution of the resources to multiple parallel processing framworks like impala and core MapReduce component. Hadoop 2.X has better cluster utilization and it helps for the application to scale large number of jobs.

  1. Name the different types of modules in Apache Hadoop 2.0 framework?

Hadoop 2.0 contains the four important modules such as Hadoop common, HDFS, MapReduce and YARN.

  1. How the distance between two nodes is calculated in Hadoop?

The bandwidth is difficult to measure in the Hadoop and the distance is denoted as tree in Hadoop. The method getDistance is used to calculate the distance between two nodes is calculated with an assumption that the distance between the parent node and the node is 1.

  1. How to test the quality of the data?

The data obtained from various sources are used by the vendors and customers. So, the data stored are cleaned by using the rules and properties like the conformity, perfection, repetition, reliability, validity, and completeness of data.

Title

We have framed these questions after scrutinizing the repetitive questions over the past few years in the interviews. These answers will help the intended students to appear in the interview with full confidence as all the parts of the Hadoop have been discussed.

Comments are closed.