博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Build an ETL Pipeline With Kafka Connect via JDBC Connectors
阅读量:4921 次
发布时间:2019-06-11

本文共 6262 字,大约阅读时间需要 20 分钟。

This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via JDBC connections.

Read this eGuide to discover  and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with .

Tutorial: Discover how to build a pipeline with Kafka leveraging DataDirect PostgreSQL JDBC driver to move the data from PostgreSQL to HDFS. Let’s go streaming!

Apache Kafka is an open source distributed streaming platform which enables you to build streaming data pipelines between different applications. You can also build real-time streaming applications that interact with streams of data, focusing on providing a scalable, high throughput and low latency platform to interact with data streams.

Earlier this year, Apache Kafka announced a new tool called Kafka Connect which can helps users to easily move datasets in and out of Kafka using connectors, and it has support for JDBC connectors out of the box! One of the major benefits for DataDirect customers is that you can now easily build an ETL pipeline using Kafka leveraging your . Now you can easily connect and get the data from your data sources into Kafka and export the data from there to another data source.

Apache Kafka

Image From https://kafka.apache.org/

Environment Setup

Before proceeding any further with this tutorial, make sure that you have installed the following and are configured properly. This tutorial is written assuming you are also working on Ubuntu 16.04 LTS, you have PostgreSQL, Apache Hadoop, and Hive installed.

  1. Installing Apache Kafka and required toolsTo make the installation process easier for people trying this out for the first time, we will be installing Confluent Platform. This takes care of installing Apache Kafka, Schema Registry and Kafka Connect which includes connectors for moving files, JDBC connectors and HDFS connector for Hadoop.
    1. To begin with, install Confluent’s public key by running the command: wget -qO - | sudo apt-key add -
    2. Now add the repository to your sources.list by running the following command: sudo add-apt-repository "deb  stable main"
    3. Update your package lists and then install the Confluent platform by running the following commands: sudo apt-get updatesudo apt-get install confluent-platform-2.11.7
  2. Install DataDirect PostgreSQL JDBC driver
    1. Download DataDirect PostgreSQL JDBC driver by visiting .
    2. Install the PostgreSQL JDBC driver by running the following command: java -jar PROGRESS_DATADIRECT_JDBC_POSTGRESQL_ALL.jar
    3. Follow the instructions on the screen to install the driver successfully (you can install the driver in evaluation mode where you can try it for 15 days, or in license mode, if you have bought the driver)  
  3. Configuring data sources for Kafka Connect
    1. Create a new file called postgres.properties, paste the following configuration and save the file. To learn more about the modes that are being used in the below configuration file, visit this . name=test-postgres-jdbcconnector.class=io.confluent.connect.jdbc.JdbcSourceConnectortasks.max=1connection.url=jdbc:datadirect:postgresql://<;server>:<port>;User=<user>;Password=<password>;Database=<dbname>mode=timestamp+incrementingincrementing.column.name=<id>timestamp.column.name=<modifiedtimestamp>topic.prefix=test_jdbc_table.whitelist=actor
    2. Create another file called hdfs.properties, paste the following configuration and save the file. To learn more about HDFS connector and configuration options used, visit this . name=hdfs-sinkconnector.class=io.confluent.connect.hdfs.HdfsSinkConnectortasks.max=1topics=test_jdbc_actorhdfs.url=hdfs://<;server>:<port>flush.size=2hive.metastore.uris=thrift://<;server>:<port>hive.integration=trueschema.compatibility=BACKWARD
    3. Note that postgres.properties and hdfs.properties have basically the connection configuration details and behavior of the JDBC and HDFS connectors.  
    4. Create a symbolic link for DataDirect Postgres JDBC driver in Hive lib folder by using the following command: ln -s /path/to/datadirect/lib/postgresql.jar /path/to/hive/lib/postgresql.jar
    5. Also make the DataDirect Postgres JDBC driver available on Kafka Connect process’s CLASSPATH by running the following command: export CLASSPATH=/path/to/datadirect/lib/postgresql.jar
    6. Start the Hadoop cluster by running following commands: cd /path/to/hadoop/sbin./start-dfs.sh./start-yarn.sh
  4. Configuring and running Kafka Services
  5. Download the configuration files for ,  and  services
  6. Start the Zookeeper service by providing the zookeeper.properties file path as a parameter by using the command: zookeeper-server-start /path/to/zookeeper.properties
  7. Start the Kafka service by providing the server.properties file path as a parameter by using the command: kafka-server-start /path/to/server.properties
  8. Start the Schema registry service by providing the schema-registry.properties file path as a parameter by using the command:               schema-registry-start /path/to/ schema-registry.properties

Ingesting Data Into HDFS using Kafka Connect

To start ingesting data from PostgreSQL, the final thing that you have to do is start Kafka Connect. You can start Kafka Connect by running the following command:

connect-standalone /path/to/connect-avro-standalone.properties \ /path/to/postgres.properties /path/to/hdfs.properties

This will import the data from PostgreSQL to Kafka using DataDirect PostgreSQL JDBC drivers and create a topic with name test_jdbc_actor. Then the data is exported from Kafka to HDFS by reading the topic test_jdbc_actor through the HDFS connector. The data stays in Kafka, so you can reuse it to export to any other data sources.

Next Steps

We hope this tutorial helped you understand on how you can build a simple ETL pipeline using Kafka Connect leveraging  drivers. This tutorial is not limited to PostgreSQL. In fact, you can create ETL pipelines leveraging any of our  drivers that we offer for Relational databases like ,  and , Cloud sources like and  or BigData sources like ,  and  by following similar steps. Also, subscribe to our blog via  or   for more awesome tutorials.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and, brought to you in partnership with . 

转载于:https://www.cnblogs.com/felixzh/p/6035517.html

你可能感兴趣的文章
[mark]如何删除地址栏的记录?
查看>>
python CSV写中文
查看>>
poj3304 Segments
查看>>
Android onNewIntent调用时机
查看>>
命令模式
查看>>
MySQL 基础命令
查看>>
用css画个遨游logo
查看>>
杭电2061
查看>>
硬盘的工作原理
查看>>
开发日志
查看>>
使用 Intellij Idea 导出JavaDoc
查看>>
js -- 写个闭包
查看>>
属性动画
查看>>
html5中<body>标签支持的事件
查看>>
F. 约束
查看>>
安装 jdk
查看>>
对康拓展开式和逆康托展开式的认识
查看>>
第二次作业(homework-02)成绩公布
查看>>
KVM&amp;Libvirt基本概念及开发杂谈
查看>>
flv视频格式详细解析
查看>>