Conversation
✅ Deploy Preview for thriving-cassata-78ae72 canceled.
|
Just ran some to get an idea: parsing all the spark tpcds spark queries: serializing all tpcds spark queries: deserializing them: keep in mind these timings are for 95 queries not individual |
|
@CircArgs Hmm, actually I tried this with one of our use cases -- specifically some metrics that depend on several layers of transforms, each of which have fairly nested subqueries. For those metrics, I serialized the compiled query AST and then deserialized it, but it takes a long time to deserialize (more than five minutes per metric). Maybe there's something else going on, but parsing and recompiling is faster. I started a change that would save this serialized ast on a node revision, but I'll hold off until we get to the bottom of the perf issues. I was having similar issues in #699, where deserialization turned out to be slower than just recompiling. |
|
@shangyian I can hold off on this for now then. I wonder if there's a big distinction with just serializing/de non-compiled queries vs compiled queries. My timings were all non-compiled |
|
@CircArgs Maybe it's because some compiled queries, if they're pulling together many layers of transforms, can be huge. But I would still expect this to be faster than actually having to compile the queries. 🤔 On that thought, we also need to make the queries built more efficient / readable by removing all the columns that aren't used. |
Summary
Serializing and deserializing ASTs maintaining all information even after compilation into a flat form that is json serializable
Test Plan
unit tests
make checkpassesmake testshows 100% unit test coverage