主要内容:基于scala代码实现Flink1.10.0对接Kafka。

上一篇文章:Flink——DataStream API介绍了DataStream API中内置了大量数据源,如文件、目录、socket以及collections等,但这些在生产环境应用较少。实际应用场景中,大都是从流的Source中实时读取数据,如Kafka。本章将结合Apache Kafka Connector介绍Kafka SourceKafka Sink

Kafka连接器

Flink内置了Kafka连接器,可用于生产和消费Kafka数据。重要的是,Flink Kafka Consumer集成了Flink的检查点机制,可提供Exactly Once。Flink并不会完全依赖Kafka的offset,而是在内部跟踪和检查这些offset。

下表为不同版本的Kafka与Flink Kafka Consumer的对应关系:

Maven Dependency Supported since Consumer and Producer Class name Kafka version
flink-connector-kafka-0.8_2.11 1.0.0 FlinkKafkaConsumer08
FlinkKafkaProducer08
0.8.x
flink-connector-kafka-0.9_2.11 1.0.0 FlinkKafkaConsumer09
FlinkKafkaProducer09
0.9.x
flink-connector-kafka-0.10_2.11 1.2.0 FlinkKafkaConsumer10
FlinkKafkaProducer10
0.10.x
flink-connector-kafka-0.11_2.11 1.4.0 FlinkKafkaConsumer11
FlinkKafkaProducer11
0.11.x
flink-connector-kafka_2.11 1.7.0 FlinkKafkaConsumer
FlinkKafkaProducer
>= 1.0.0

Kafka Consumer根据版本分别叫做FlinkKafkaConsumer08FlinkKafkaConsumer09等等,而Kafka >= 1.0.0 的版本就叫FlinkKafkaConsumer。另外从Flink1.9.0开始使用Kafka2.2.0客户端。

首先导入依赖:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.11</artifactId>
  <version>1.10.0</version>
</dependency>

兼容性:

从Flink 1.7开始,它不跟踪特定的Kafka主要版本。如果你的Kafka代理版本是1.0.0或更高版本,则应使用此Kafka连接器。如果使用旧版本的Kafka(0.11,0.10,0.9或0.8),则应使用与代理版本对应的连接器。

完整代码案例:

KafkaSource

package org.ourhome.streamapi

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaConsumerBase}

/**
 * @Author Do
 * @Date 2020/4/14 23:25
 */
object KafkaSource {
  private val KAFKA_TOPIC: String = "kafka_producer_test"
  def main(args: Array[String]) {
    val params: ParameterTool = ParameterTool.fromArgs(args)
    val runType:String = params.get("runtype")
    println("runType: " + runType)

    val properties: Properties = new Properties()
    properties.setProperty("bootstrap.servers", "ip:host")
    properties.setProperty("group.id", "kafka_consumer")

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    // exactly-once 语义保证整个应用内端到端的数据一致性
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    // 开启检查点并指定检查点时间间隔为5s
    env.enableCheckpointing(5000) // checkpoint every 5000 msecs
    // 设置StateBackend,并指定状态数据存储位置
    env.setStateBackend(new FsStateBackend("file:///D:/Temp/checkpoint/flink/KafkaSource"))


    val dataSource: FlinkKafkaConsumerBase[String] = new FlinkKafkaConsumer(
      KAFKA_TOPIC,
      new SimpleStringSchema(),
      properties)
      .setStartFromLatest()  // 指定从最新offset开始消费

    env.addSource(dataSource)
      .flatMap(_.toLowerCase.split(" "))
      .map((_, 1))
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .sum(1)
      .filter(_._2 > 5)
      .print()
      .setParallelism(1)

    // execute program
    env.execute("Flink Streaming—————KafkaSource")
  }

}

KafkaSink

package org.ourhome.streamapi

import java.util.Properties

import org.apache.flink.api.common.serialization.{SimpleStringSchema}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaConsumerBase, FlinkKafkaProducer}

/**
 * @Author Do
 * @Date 2020/4/15 23:22
 */
object WriteIntoKafka {
  private val KAFKA_TOPIC: String = "kafka_producer_test"

  def main(args: Array[String]): Unit = {
    val params: ParameterTool = ParameterTool.fromArgs(args)
    val runType:String = params.get("runtype")
    println("runType: " + runType)

    val properties: Properties = new Properties()
    properties.setProperty("bootstrap.servers", "ip:host")
    properties.setProperty("group.id", "kafka_consumer")

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    // exactly-once 语义保证整个应用内端到端的数据一致性
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    // 开启检查点并指定检查点时间间隔为5s
    env.enableCheckpointing(5000) // checkpoint every 5000 msecs
    // 设置StateBackend,并指定状态数据存储位置
    env.setStateBackend(new FsStateBackend("file:///D:/Temp/checkpoint/flink/KafkaSource"))

    val dataSource: FlinkKafkaConsumerBase[String] = new FlinkKafkaConsumer(
      KAFKA_TOPIC,
      new SimpleStringSchema(),
      properties)
      .setStartFromLatest()  // 指定从最新offset开始消费

    val dataStream: DataStream[String] = env.addSource(dataSource)
    val kafkaSink: FlinkKafkaProducer[String] = new FlinkKafkaProducer[String](
      "brokerList",
      "topic",
      new SimpleStringSchema()
    )
    dataStream.addSink(kafkaSink)

    // execute program
    env.execute("Flink Streaming—————KafkaSource and KafkaSink")
  }
}