Impala is the industry’s first native real-time SQL query engine for Apache Hadoop, it is the newest component of CDH. Impala completely changes the way organizations can benefit from Hadoop. Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Using Impala,Data processing workload acceleration, with data pipelines will last seconds instead of minutes or hours, to meet tighter service-level agreement (SLA) specifications. It has a Interactive business intelligence with popular tools. This opens up real-time access to big data to every analyst in the organization, without requiring any special training, significantly lowering the adoption risk of a big data project and accelerating return on investment (ROI). It reduces overall cost of data management, Instead of replicating large amounts of data to a relational database to get interactive SQL performance, Cloudera customers can obtain the same experience without added cost or complexity.Impala is meant to be good at what hive is bad at i.e fast response queries and it is also meant to be good at what hive is good at. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software. Impala is used by analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools. The result is that large-scale data processing and interactive queries can be done on the same system using the same data and metadata ± removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis. Before Impala, if your relational database was at capacity, you may have had no choice but to expand that system to maintain your expectations of performance. If you were using Hadoop to affordably analyze any amount or kind of data, but wanted interactive performance, you had to move that data into a fast relational database. You then had to accept the cost and effort of duplicate storage and data synchronization; accept the rigidity of requiring fixed schemas; accept that when you moved and transformed data you would inevitably leave something behind; accept that your analysis options would be limited in that target database.With Impala, you now have a choice. As a native component of the Hadoop ecosystem, Impala combines all of the benefits of other Hadoop frameworks, including flexibility, scalability, and cost-effectiveness, with the performance, usability, and SQL functionality necessary for an enterprise-grade analytic database.Impala was specifically targeted for integration with standard business intelligence environments, and to that end supports most relevant industry standards: clients can connect via ODBC or JDBC; authentication is accomplished with Kerberos or LDAP; authorization follows the standard SQL roles and privileges.In order to query HDFS-resident data, the user creates tables via the familiar CREATE TABLE statement, which, in addition to providing the logical schema of the data, also indicates the physical layout, such as file format(s) and placement within the HDFS directory structure. Those tables can then be queried with standard SQL syntax. WORKING OF IMPALA WITH HIVE:Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and ELT pipelines. A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.In particular,Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs as shown in Fig.1.The initial focus on query features and performance means that Impala can read more types of data with the 6(/(&7 statement than it can write with the ,16(57 statement. To query data using the Avro, RCFile, or SequenceFile file formats, you load the data using Hive. The Impala query optimizer can also make use of table statistics and column statistics. Originally, you gathered this information with the $1$/