Story Source and Credit: Ahsan Hadi & Ibrar Ahmed, Feb 20, 2015
Advances in Postgres in recent releases have opened up opportunities for expanding features that support data integration. The most compelling of these new features are called Foreign Data Wrappers (FDWs). FDWs essentially act as a pipeline for data to move to and from Postgres and other kinds of databases as if the different solutions were a single database. EnterpriseDB has developed new FDWs for MongoDB and MySQL. Now, we have released one for Hadoop.
The FDW for Hadoop can be found on EDB’s GitHub page.
Foreign Data Wrappers enable Postgres to act as the central hub, a federated database, in the enterprise. And the features have emerged at a good time as integration becomes more of a challenge. Database administrators (DBAs) have begun using more and more specialized NoSQL-only solutions to address specific problems. Now DBAs need to address getting the data in all of these different solutions to tell one single story, or lend value to one another.
Since EDB released the new FDWs with both read and write capability to the open source PostgreSQL community, usage has grown tremendously. More and more developers are contributing features and fixes.
Now that we’ve established some context for the FDWs, let’s explore the Hadoop FDW in more detail. There are few popular terms worth exploring first:
Hadoop Distributed File System (HDFS): HDFS is a distributed file system for large data storage and processing. It’s java-based and provides scalable and reliable storage that is designed to span large clusters of commodity servers. The data is stored in flat files and format-free. As a result, it’s used for very large data sets and a typical HDFS file is a GB to TB in size. Applications that run HDFS require streaming access to their data sets.
MapReduce: MapReduce is a programming model and associated implementation for processing and generating large data sets. The model was inspired by the map and reduce functions commonly used in functional programming. MapReduce jobs are written in java and are used for performing statistical analysis, aggregates or other complex processing on large data sets stored in HDFS.
Hive Server: The Apache Hive is data warehouse software that facilitates querying and managing large datasets residing in distributed storage i.e. an HDFS. Hive defines a simple query-like language called QL which is used for querying and manipulating large data sets stored in an HDFS. The QL language is similar to SQL and provides similar constructs for retrieving data. Hive server is an optional service that allows remote clients to send requests to HIVE using various programming languages and retrieve results.
Foreign Data Wrapper (FDW): While we have introduced FDWs already, it’s good to know they are based on Postgres implementation of the SQL/MED (SQL management of external data) specification of the SQL standard. It is a standard way of accessing external data stores ranging from SQL and NoSQL-only databases to flat files. FDWs provide a SQL interface for accessing remote objects and large data objects stored in remote data stores. The FDWs supported by EnterpriseDB are postgres_fdw, oracle_fdw, mongo_fdw, mysql_fdw and now we’re adding HDFS_fdw to the list.
Full Story, Installation Instructions and Code Examples Here…….