Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL
winwang | 135 points | 7day ago |
Hey HN! I'm Win, founder of ParaQuery (https://paraquery.com), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.
Here's a short demo video demonstrating ParaQuery (vs. BigQuery) on a simple ETL job: https://www.youtube.com/watch?v=uu379YnccGU
It's well known that GPUs are very good for many SQL and dataframe tasks, at least by researchers and GPU companies like NVIDIA. So much so that, in 2018, NVIDIA launched the RAPIDS program and the Spark-RAPIDS plugin (https://github.com/NVIDIA/spark-rapids). I actually found out because, at the time, I was trying to craft a CUDA-based lambda calculus interpreter…one of several ideas I didn't manage to implement, haha.
There seems to be a perception among at least some engineers that GPUs are only good for AI, graphics, and maybe image processing (maybe! someone actually told me they thought GPUs are bad for image processing!) Traditional data processing doesn’t come to mind. But actually GPUs are good for this as well!
At a high level, big data processing is a high-throughput, massively parallel workload. GPUs are a type of hardware specialized for this, are highly programmable, and (now) happen to be highly available on the cloud! Even better, GPU memory is tuned for bandwidth over raw latency, which only improves their throughput capabilities compared to a CPU. And by just playing with cloud cost calculators for a couple of minutes, it's clear that GPUs are cost-effective even on the major clouds.
To be honest, I thought using GPUs for SQL processing would have taken off by now, but it hasn't. So, just over a year ago, I started working on actually deploying a cloud-based data platform powered by GPUs (i.e. Spark-RAPIDS), spurred by a friend-of-a-friend(-of-a-friend) who happened to have BigQuery cost concerns at his startup. After getting a proof of concept done and a letter of intent... well, nothing happened! Even after over half a year. But then, something magical did happen: their cloud credits ran out!
And now, they're saving over 60% off of their BigQuery bill by using ParaQuery, while also being 2x faster -- with zero data migration needed (courtesy of Spark's GCS connector). By the way, I'm not sure about other people's experiences but... we're pretty far from being IO-bound (to the surprise of many engineers I've spoken to).
I think that the future of high-throughput compute is computing on high-throughput hardware. If you think so too, or you have scaling data challenges, you can sign up here: https://paraquery.com/waitlist. Sorry for the waitlist, but we're not ready for a self-serve experience just yet—it would front-load significant engineering and hardware cost. But we’ll get there, so stay tuned!
Thanks for reading! What have your experiences been with huge ETL / processing loads? Was cost or performance an issue? And what do you think about GPU acceleration (GPGPU)? Did you think GPUs were simply expensive? Would love to just talk about tech here!
andygrove|7day ago
Congrats on the launch!
I contributed to the NVIDIA Spark RAPIDS project for ~4 years and for the past year have been contributing to DataFusion Comet, so I have some experience in Spark acceleration and I have some questions!
1. Given the momentum behind the existing OSS Spark accelerators (Spark RAPIDS, Gluten + Velox, DataFusion Comet), have you considered collaborating with and/or extending these projects? All of them are multi-year efforts with dedicated teams. Both Spark RAPIDS and Gluten + Velox are leveraging GPUs already.
2. You mentioned that "We're fully compatible with Spark SQL (and Spark)." and that is very impressive if true. None of the existing accelerators claim this. Spark compatibility is notoriously difficult with Spark accelerators built with non-JVM languages and alternate hardware architectures. You have to deal with different floating-point implementations and regex engines, for example.
Also, Spark has some pretty quirky behavior. Do you match Spark when casting the string "T2" to a timestamp, for example? Spark compatibility has been pretty much the bulk of the work in my experience so far.
Providing acceleration at the same time as guaranteeing the same behavior as Spark is difficult and the existing accelerators provide many configuration options to allow users to choose between performance and compatibility. I'm curious to hear your take on this topic and where your focus is on performance vs compatibility.
winwang|7day ago
1. Yes! Would love to contribute back to these projects, since I am already using RAPIDS under the hood. My general goal is to bring GPU acceleration to more workloads. Though, as solo founder, I am finding it difficult to have any time for this at the moment, haha.
2. Hmm, maybe I should mention that we're not "accelerating all operations" -- merely compatible. Spark-RAPIDS has the goal of being byte-for-byte compatible unless incompatible ops are specifically allowed. But... you might be right about that kind of quirk. Would not be surprising, and reminds me of checking behavior between compilers.
I'd say the default should be a focus on compatibility, and work through any extra perf stuff with our customers. Maybe a good quick way to contribute back to open source is to first upstream some tests?
Thanks for your great questions :)
torsstei|15hrs ago
Cool! Do you have a positioning versus Databricks support for Spark-RAPIDS (https://github.com/NVIDIA/spark-rapids-ml/blob/main/notebook...)?