Storm Kafka – Support of Wildcard Topics.

Storm is great for consuming “streams” of data.  It offers good support for consuming topic streams from Kafka.  However, as of writing this article, Storm Kafka module does not support wildcard topics.  For example, lets say we want to consume all topics matching a particular pattern. e.g:

clickstream.abc.log

clickstream.def.log

clickstream.xyz.log

To consume this topic you will have to create multiple spouts.  And in case new topics get added, there is no way for you to consume those dynamically.  This was a big drawback of Storm Kafka module for me.  One option for me was to create my own Kafka Spout implementation.  But that would mean, i would miss all the existing features in Storm Kafka Spout.

To overcome this, i decided to augment the Storm Kafka code with this wildcard feature.  I have opened a merge request for this change https://github.com/apache/storm/pull/561.   Till this get merged, you can directly consume the change from my repo @ https://github.com/sumitchawla/storm.  Here are the steps:

  1. Add following repository.  Jitpack allows you to consume this github repo as a maven repository.
 <repository>
   <id>jitpack.io</id>
   <url>https://jitpack.io</url>
 </repository>

2.    Add following dependency to your maven project. Here version tag is the github commit number.

 <dependency>
  <groupId>com.github.sumitchawla</groupId>
  <artifactId>storm</artifactId>
  <version>fe7de58f09</version>
  <exclusions>
    <exclusion>
       <groupId>com.github.sumitchawla.storm</groupId>
       <artifactId>storm-core</artifactId>
    </exclusion>
    <exclusion>
       <groupId>com.github.sumitchawla.storm</groupId>
       <artifactId>storm-hive</artifactId>
    </exclusion>
  </exclusions>
 </dependency>

3.    For consuming the code, add following config to enable wildcard topic support

       config.put("kafka.topic.wildcard.match",true);

After this if you pass a wildcard topic to Kafka Spout e.g. “clickstream.*.log”, then it will match all topics matching this pattern.  It will discover for new topics dynamically, and you would need to create a single Kafka Spout.  You can use a parallelism hint so that all the partitions of all the matching topics get distributed equally.

Update:

As of Nov 2nd, 2015  this change has been absorbed in main branch of apache\storm code.

https://github.com/apache/storm/commit/60d9f81ba1b16f7711c487f831f202e49eda258c

References:

3 thoughts on “Storm Kafka – Support of Wildcard Topics.

  1. Hi Sumit

    Does this create one executor per partition or try to map multiple partitions to single executor. Issue in my case is,I have multiple topics. I need to consume them all in a single topology. If there are 15 topics with 10 partitions each, then I will end up having parallelism of 150. I know, I don’t need that much. Having just 10 will solve my requirements. Will this PR help deal with issue mentioned.

    Like

    • It will evenly divide the partitions among tasks. So you in your case since you have parallelism of 10 – i.e means each task will get 15 * 10 ( no of partitions) / 10 ( no of tasks) = 15 partitions

      Like

Leave a reply to sumitchawla Cancel reply