Flink——Flink1.10.0整合Kafka之KafkaSource和KafkaSink
主要内容:基于scala代码实现Flink1.10.0对接Kafka。
上一篇文章:Flink——DataStream API介绍了DataStream API中内置了大量数据源,如文件、目录、socket以及collections等,但这些在生产环境应用较少。实际应用场景中,大都是从流的Source
中实时读取数据,如Kafka。本章将结合Apache Kafka Connector
介绍Kafka Source
和Kafka Sink
。
Kafka连接器
Flink内置了Kafka连接器,可用于生产和消费Kafka数据。重要的是,Flink Kafka Consumer
集成了Flink的检查点机制,可提供Exactly Once
。Flink并不会完全依赖Kafka的offset,而是在内部跟踪和检查这些offset。
下表为不同版本的Kafka与Flink Kafka Consumer
的对应关系:
Maven Dependency | Supported since | Consumer and Producer Class name | Kafka version |
---|---|---|---|
flink-connector-kafka-0.8_2.11 | 1.0.0 | FlinkKafkaConsumer08 FlinkKafkaProducer08 |
0.8.x |
flink-connector-kafka-0.9_2.11 | 1.0.0 | FlinkKafkaConsumer09 FlinkKafkaProducer09 |
0.9.x |
flink-connector-kafka-0.10_2.11 | 1.2.0 | FlinkKafkaConsumer10 FlinkKafkaProducer10 |
0.10.x |
flink-connector-kafka-0.11_2.11 | 1.4.0 | FlinkKafkaConsumer11 FlinkKafkaProducer11 |
0.11.x |
flink-connector-kafka_2.11 | 1.7.0 | FlinkKafkaConsumer FlinkKafkaProducer |
>= 1.0.0 |
Kafka Consumer
根据版本分别叫做FlinkKafkaConsumer08
、FlinkKafkaConsumer09
等等,而Kafka >= 1.0.0 的版本就叫FlinkKafkaConsumer
。另外从Flink1.9.0
开始使用Kafka2.2.0
客户端。
首先导入依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency>
兼容性:
从Flink 1.7开始,它不跟踪特定的Kafka主要版本。如果你的Kafka代理版本是1.0.0或更高版本,则应使用此Kafka连接器。如果使用旧版本的Kafka(0.11,0.10,0.9或0.8),则应使用与代理版本对应的连接器。
完整代码案例:
KafkaSource
package org.ourhome.streamapi
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaConsumerBase}
/**
* @Author Do
* @Date 2020/4/14 23:25
*/
object KafkaSource {
private val KAFKA_TOPIC: String = "kafka_producer_test"
def main(args: Array[String]) {
val params: ParameterTool = ParameterTool.fromArgs(args)
val runType:String = params.get("runtype")
println("runType: " + runType)
val properties: Properties = new Properties()
properties.setProperty("bootstrap.servers", "ip:host")
properties.setProperty("group.id", "kafka_consumer")
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// exactly-once 语义保证整个应用内端到端的数据一致性
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// 开启检查点并指定检查点时间间隔为5s
env.enableCheckpointing(5000) // checkpoint every 5000 msecs
// 设置StateBackend,并指定状态数据存储位置
env.setStateBackend(new FsStateBackend("file:///D:/Temp/checkpoint/flink/KafkaSource"))
val dataSource: FlinkKafkaConsumerBase[String] = new FlinkKafkaConsumer(
KAFKA_TOPIC,
new SimpleStringSchema(),
properties)
.setStartFromLatest() // 指定从最新offset开始消费
env.addSource(dataSource)
.flatMap(_.toLowerCase.split(" "))
.map((_, 1))
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1)
.filter(_._2 > 5)
.print()
.setParallelism(1)
// execute program
env.execute("Flink Streaming—————KafkaSource")
}
}
KafkaSink
package org.ourhome.streamapi
import java.util.Properties
import org.apache.flink.api.common.serialization.{SimpleStringSchema}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaConsumerBase, FlinkKafkaProducer}
/**
* @Author Do
* @Date 2020/4/15 23:22
*/
object WriteIntoKafka {
private val KAFKA_TOPIC: String = "kafka_producer_test"
def main(args: Array[String]): Unit = {
val params: ParameterTool = ParameterTool.fromArgs(args)
val runType:String = params.get("runtype")
println("runType: " + runType)
val properties: Properties = new Properties()
properties.setProperty("bootstrap.servers", "ip:host")
properties.setProperty("group.id", "kafka_consumer")
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// exactly-once 语义保证整个应用内端到端的数据一致性
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// 开启检查点并指定检查点时间间隔为5s
env.enableCheckpointing(5000) // checkpoint every 5000 msecs
// 设置StateBackend,并指定状态数据存储位置
env.setStateBackend(new FsStateBackend("file:///D:/Temp/checkpoint/flink/KafkaSource"))
val dataSource: FlinkKafkaConsumerBase[String] = new FlinkKafkaConsumer(
KAFKA_TOPIC,
new SimpleStringSchema(),
properties)
.setStartFromLatest() // 指定从最新offset开始消费
val dataStream: DataStream[String] = env.addSource(dataSource)
val kafkaSink: FlinkKafkaProducer[String] = new FlinkKafkaProducer[String](
"brokerList",
"topic",
new SimpleStringSchema()
)
dataStream.addSink(kafkaSink)
// execute program
env.execute("Flink Streaming—————KafkaSource and KafkaSink")
}
}