Post messages quickly from a dataframe in bulk to kafka

We have been trying to publish a group of messages from a dataframe. It was suggested to try using polars instead of pandas. It made only a small improvement. We either neeed to cycle through every record, one at a time (takes 16 seconds for 55k records), or if we push the entire dataframe the data never gets to kafka, and we do not see errors.

Hi @Michael_Zimberg!

Since every environment is different, there’s no one-size-fits-all solution for high performance, and so you’ll need to do some more investigation and tuning to resolve your bottlenecks. You could be saturating any number of resources:

  • The serialization throughput in your application
  • The buffering&batching logic in the Producer
  • The network link between your application and Kafka
  • Throttling & quotas configured on the broker
  • The disk on Kafka broker side

The Kafka Producer has a number of tuning configurations that may be easy wins for performance:

  • compression
  • batch size
  • linger ms
  • buffer memory
  • max in flight requests

A fully optimized Kafka layout can saturate the network and/or disk of the Kafka broker, so you should start by profiling and examining metrics for your client-side application to see where the slowdown is.

As far as the other thing you mentioned:

if we push the entire dataframe the data never gets to kafka, and we do not see errors.

If your data is over 1MB after serialization, you may be exceeding Kafka’s default message size limits. Kafka is oriented towards many smaller messages rather than single large (~file sized) ones. Often using file-sized records require a non-kafka component to manage, such as uploading the records to object storage and forwarding the handle to the document via Kafka.

Serializing the individual rows of data and handling them as separate records, then reassembling them after consuming is a much more typical design pattern.

2 Likes

Wow, amazing answer, Greg! I’ve tagged this question as a “faq” since that answer is likely to help other folks with similar issues.

And thank you too @Michael_Zimberg for asking it. :smiley: