Skip to content

[PERF]: Measure the performance impact of the "layered design" in cuda-bindings #1605

@mdboom

Description

@mdboom

API calls in cuda-bindings currently are made through 3 layers.

As an experiment to measure the performance impact of calling through these layers, I "flattened" the call so the top layer just directly calls the C function pointer in the library (currently handled by the bottom layer). The overhead of each of these layers is pretty small, by design, but there is still some Python exception handling, as well as our library initialization check (cuPythonInit()) along the way.

While we lose some safety and version independence doing this, it is useful as an experiment to see what the cost of that flexibility is.

My changes

Measuring this with the benchmark in #659, I do not see any measurable change. Branch predictors must be pretty good these days.

Before: Mean +- std dev: 2.77 us +- 0.37 us
After: Mean +- std dev: 2.76 us +- 0.21 us

Metadata

Metadata

Assignees

No one assigned

    Labels

    experimentDescribes an investigation or measurement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions