First, try to optimize your Python. It's unexpected what you can do with it. E.G: slicing assignment is crazy fast, removing calls and lookups goes a long way, using built-ins + generators + @lru_cache + slicing wins a lot, etc. Also Python 3.6 is faster, so upgrading is nice.
Then, you try pypy. It may very well run your code 10 times faster with no work on your part.
If you can't or the result is not as good as expected, you can start using EXISTING compiled extensions. numpy, uvloop, ujson, etc.
After that, and only now, should you think about a rewrite. Numba for number related code, cython for classic hot paths or nuikta for the entire app: they can turn Python code into compiled code, and will prevent a rewrite in a low level tech.
If all of that failed, congratulation, you are part of the 0.00001%.
So rewriting bottlenecks in a compiled language is a good option. C or C++ will do. But remember you can do it in Rust too!
I love Python, but this is one of the pain points. Working through a sequence of domain specific languages when you could have just written it in a fast one to begin with (e.g. Julia or C++).
The reason so few projects are rewritten in C/C++ is that many people know up front that their project will require that performance and just start there.
If you are building a high end 3d video game with anything like current fancy graphics no amount of python or ruby is going to make it work. You must start with C or C++ to make effective use of modern hardware (even using the C# unity provides leaves a lot of performance on the table).
If you are building a system designed to be faster that some other well defined system then starting with C or C++ is a good idea. If your Java or C# system could handle 1 million transactions a second you might be able to complete 1.5 million/s with C++.
Some projects never need that level of performance, building those projects on C++ can cost you some time. Most webpages are in that vein, how many hits a day does a typical website get? only a few of the biggest retailers and search engine need that level of performance.
That time cost is also shrinking, but not shrinking as fast as I would like. C++11, 14 and 17 it took chunks off development time by polishing some of the sharp corners of the language. Memory leaks are harder to make. Threads and time are easier to work with. Error message are better than ever.
There is still progress to make. Every C++ project still needs some time dedicated to configuring the build system. There needs to be some plan for checking for memory issues, there needs to be... I think C++ will continue to get more Rust-like and Rust will continue to grow in popularity and performance. Eventually, I think Rust or something like it will be the preferred high performance language.
The secret is to write as much as you can in "high level" (Python) and then just specialize the critical path in C/C++. That gives you the best balance between clarity, developer time and performance.
I'd remove PyPy. It's not 100% compatible and there are antipatterns in between them. For some time I've treated CPython and PyPy as two very similar, but different languages. PyPy is more C-ish if you want to call it that way, what's fast in it is more direct, similar to the C mindset. In Python a good abstraction usually will give you better performance. If you mix them, it's not quite one thing or the other.
This doesn't directly answer your question, but I think a better use case for pybind11 is when you have an existing C++ library with a fairly rich typesystem and you want to expose it to Python.
If you just want to reimplement some parts of your code in C for performance (I'd argue you neither need nor want C++, you shouldn't bother with C++'s object system and you should keep using Python)'s, CFFI might be simpler:
A typical use case is pushing loops in the hot path of your code to the C++ space. A simple loop that does nothing like `for i in range(int(1e9)): pass` takes about 20 seconds to execute in Python on my machine, whereas in C++ the overhead would be thousands times smaller.
Because there's also overhead for transferring objects to/from pybind11 (it has to keep track of object lifetimes, figure out the conversions, etc), it's generally more beneficial to wrap big chunks of logic in C++ rather than every single method.
I wrote a demuxer for CAN Bus data in C++. Then, I built a Python module using Boost.python so that our Python programmers could simply 'import demux' and write scripts to see and manipulate individual variables, load data into the DB, etc.
The productivity gains of Python scripts that could demux were huge. It made demuxing fast and easily accessible to all of the Python coders. Also, the underlying C++ code could be used in our C# environment too, although I'm not sure if they ever did that. So, we had the ability to use the exact same underlying C++ code in multiple dev environments to ensure consistency.
The downside, IMO, was maintaining the modules. The C++ code itself was short and easy to test (maybe 100 or 200 LOC), but you need good documentation and the ability to build the modules for various versions of Python, with various toolchains, on various systems, etc.
I would do it again if in the same or similar situation.
The easiest would be to first give cython a try, as it can provide a nice performance boost at a minimum effort in some cases.
If you just need to bind simple number crunching functions, you may also write a small library exposing them with c linkage, then access them using ctypes.
For more advanced usage, you will probably benefit from using a binding framework instead of using the python c api.
Does anyone have experience taking one expensive method and replacing it with C/C++? Were the trade offs worth it?