您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Adding performance wings to Python algorithm -- pybind11 landing practice

編輯：Python

1. background

at present AI Algorithm development, especially training, is basically based on Python Mainly , Mainstream AI Calculation framework, such as TensorFlow、PyTorch And so on Python Interface . There is a good saying , Life is too short , I use Python. But because of Python Dynamic language , Explain execution and lack of maturity JIT programme , In computing intensive scenarios, multi-core concurrency is limited , It is difficult to directly meet the real-time requirements of high performance Serving demand . In some scenarios with high performance requirements , Still need to use C/C++ To solve . But if the algorithm students are required to use it all C++ To develop online reasoning services , And the cost is very high , Lead to development efficiency and waste of resources . therefore , If there is a light way to Python And part C++ Combined with the core code written , It can achieve the effect of ensuring both development efficiency and service performance . This paper mainly introduces pybind11 In Tencent advertising multimedia AI Python Accelerated practice of algorithm , And some experience summary in the process .

2. Industry solutions

2.1 Native scheme

Python The official provided Python/C API, Can achieve 「 use C Language writing Python library 」, Let's start with the last piece of code and feel ：

static PyObject *
spam_system(PyObject *self, PyObject *args)
{
const char *command;
int sts;
if (!PyArg_ParseTuple(args, "s", &command))
return NULL;
sts = system(command);
return PyLong_FromLong(sts);
}

It can be seen that the transformation cost is very high , All basic types must be manually changed to CPython Interpreter encapsulated binding type . It's not hard to understand , why Python The official website also recommends that you use third-party solutions 1.

2.2 Cython

Cython The main connection is Python and C, Convenient for Python To write C Expand .Cython The compiler supports conversion Python The code is C Code , these C The code can call Python/C Of API. essentially ,Cython That is to say, it contains C Data type Python. at present Python Of numpy, And our factory tRPC-Python The framework is applied .

shortcoming ：

Manual implantation is required Cython Bring your own grammar （cdef etc. ）, The cost of transplantation and reuse is high
Other documents need to be added , Such as setup.py、*.pyx Let's get you started Python Finally, the code can be converted to high-performance C Code
about C++ The level of support is in doubt

2.3 SIWG

SIWG It mainly solves other high-level languages and C and C++ The problem of language interaction , Supports more than a dozen programming languages , Including common java、C#、javascript、Python etc. . You need to use *.i File definition interface , Then use tools to generate cross language interaction code . However, due to the large number of languages supported , So in Python The end performance is not very good .

It is worth mentioning that ,TensorFlow Early also used SWIG To encapsulate Python Interface , Formally due to SIWG The performance is not good enough 、 Build complex 、 The binding code is obscure and difficult to read ,TensorFlow Has been in 2019 The year will be SIWG Switch to a pybind112.

2.4 Boost.Python

C++ Widely used in Boost Open source library , Also provided Python binding function . On use , Simplify through macro definition and metaprogramming Python Of API call . But the biggest disadvantage is the need to rely on huge Boost library , The burden of compilation and dependencies is heavy , Only for solving Python binding If so, there is a visual sense of anti-aircraft shelling mosquitoes .

2.5 pybind11

It can be understood as taking Boost.Python As a blueprint , Available only Python & C++ binding A compact version of the function , be relative to Boost.Python stay binary size And compilation speed has many advantages . Yes C++ Support is very good , be based on C++11 Various new features have been applied , Maybe pybind11 The suffix 11 That's why .

Pybind11 adopt C++ Compile time introspection to infer type information , To minimize traditional expansion Python Module is a complex template code , And implements common data types , Such as STL data structure 、 Intelligent pointer 、 class 、 function overloading 、 Example method Python Automatic conversion of , Functions can receive and return values of custom data types 、 Pointer or reference .

characteristic ：

Light weight and single function , Focus on providing C++ & Python binding, The interaction code is concise
For the common C++ Data types such as STL、Python Library like numpy Good compatibility , No labor conversion costs
only header The way , No additional code generation required , The binding relationship can be established at compile time , Reduce binary size
Support C++ New characteristics , Yes C++ overloaded 、 Inherit ,debug The way is convenient and easy to use
Perfect official document support , Applied to many well-known open source projects

“Talk is cheap, show me your code.” Three lines of code can quickly implement binding , You deserve it ：

PYBIND11_MODULE (libcppex, m) {
m.def("add", [](int a, int b) -> int { return a + b; });
}

3. Python transfer C++

3.1 from GIL The lock talks about

GIL（Global Interpreter Lock） Global interpreter lock ： Only one thread is allowed to use the interpreter in a process at the same time , As a result, multithreading cannot really use multi-core . Because the thread holding the lock is executing to I/O Some wait operations such as dense functions will be released automatically GIL lock , So for I/O For intensive services , Multithreading is effective . But for the CPU Intensive operation , Because only one thread can actually perform calculation at a time , The impact on performance can be imagined .

It must be said that ,GIL Not at all Python Its own flaws , But now Python Default CPython Thread safety protection lock introduced by parser . We generally say Python There is GIL lock , In fact, only for CPython Interpreter . So if we can find a way to avoid GIL lock , Will it have a good acceleration effect ？ The answer is yes , One solution is to use other interpreters instead, such as pypy etc. , But for mature C The compatibility of extension libraries is not good enough , Maintenance costs are high . Another option , It is through C/C++ Extensions to encapsulate compute intensive code , And remove... On execution GIL lock .

3.2 Python Algorithm performance optimization

pybind11 It provides in C++ End manual release GIL The interface of the lock , therefore , We just need to put the part of the code that is computationally intensive , Transform into C++ Code , And release... Before and after execution / obtain GIL lock ,Python The multi-core computing power of the algorithm is unlocked . Of course , In addition to displaying the release of the calling interface GIL In addition to the method of locking , It can also be in C++ Internally switch computing intensive code to other C++ Threads execute asynchronously , It can also avoid GIL The lock utilizes multi-core .

Let's say 100 Take the calculation of spherical distance between 10000 cities as an example , contrast C++ Performance difference before and after expansion ：

C++ End ：

#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pybind11/embed.h>
namespace py = pybind11;
const double pi = 3.1415926535897932384626433832795;
double rad(double d) {
return d * pi / 180.0;
}
double geo_distance(double lon1, double lat1, double lon2, double lat2, int test_cnt) {
py::gil_scoped_release release; // Release GIL lock
double a, b, s;
double distance = 0;
for (int i = 0; i < test_cnt; i++) {
double radLat1 = rad(lat1);
double radLat2 = rad(lat2);
a = radLat1 - radLat2;
b = rad(lon1) - rad(lon2);
s = pow(sin(a/2),2) + cos(radLat1) * cos(radLat2) * pow(sin(b/2),2);
distance = 2 * asin(sqrt(s)) * 6378 * 1000;
}
py::gil_scoped_acquire acquire; // C++ Resume before execution ends GIL lock
return distance;
}
PYBIND11_MODULE (libcppex, m) {
m.def("geo_distance", &geo_distance, R"pbdoc(
Compute geography distance between two places.
)pbdoc");
}

Python Calling end ：

import sys
import time
import math
import threading
from libcppex import *
def rad(d):
return d * 3.1415926535897932384626433832795 / 180.0
def geo_distance_py(lon1, lat1, lon2, lat2, test_cnt):
distance = 0
for i in range(test_cnt):
radLat1 = rad(lat1)
radLat2 = rad(lat2)
a = radLat1 - radLat2
b = rad(lon1) - rad(lon2)
s = math.sin(a/2)**2 + math.cos(radLat1) * math.cos(radLat2) * math.sin(b/2)**2
distance = 2 * math.asin(math.sqrt(s)) * 6378 * 1000
print(distance)
return distance
def call_cpp_extension(lon1, lat1, lon2, lat2, test_cnt):
res = geo_distance(lon1, lat1, lon2, lat2, test_cnt)
print(res)
return res
if __name__ == "__main__":
threads = []
test_cnt = 1000000
test_type = sys.argv[1]
thread_cnt = int(sys.argv[2])
start_time = time.time()
for i in range(thread_cnt):
if test_type == 'p':
t = threading.Thread(target=geo_distance_py,
args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
elif test_type == 'c':
t = threading.Thread(target=call_cpp_extension,
args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,))
threads.append(t)
t.start()
for thread in threads:
thread.join()
print('calc time = %d' % int((time.time() - start_time) * 1000))

Performance comparison ：

Single thread time consumption ：Python 1500ms,C++ 8ms
10 Thread time consumption ：Python 15152ms,C++ 16ms
CPU utilization ：
△ Python Multithreading can't perform multi-core parallel computing at the same time , Only equivalent to single core utilization
△ C++ You can eat all DevCloud Mechanical 10 individual CPU nucleus

Conclusion ：

Compute intensive code , Simply change to C++ Achieve good performance improvement , Release in multithreading GIL Under the blessing of the lock , Make full use of multicore , Performance easily get linear speedup , Greatly improve the utilization of resources . Although it can also be used in actual scenes Python Multi process approach to take advantage of multi-core , But when the model is getting bigger and bigger, it is often dozens of G Under the trend of , Too much memory is used, don't say , Frequent switching between processes context switching overhead, And the performance differences of the language itself , Cause and C++ There are still many gaps in the expansion mode .

（ notes ： Above tests demo github Address ：https://github.com/jesonxiang/cpp_extension_pybind11, The test environment is CPU 10 Nuclear vessel , If you are interested, you can also do performance verification .）

3.3 Compile environment

Compile instructions ：

g++ -Wall -shared -std=gnu++11 -O2 -fvisibility=hidden -fPIC -I./ perfermance.cc -o libcppex.so `Python3-config --cflags --ldflags --libs`

If Python The environment is not configured correctly, and an error may be reported ：

Here to Python Our dependence is through Python3-config --cflags --ldflags --libs To automatically specify , You can run this command alone to verify Python Whether the dependency is configured correctly .Python3-config Normal execution depends on Python3-dev, You can install it with the following command ：

yum install Python3-devel

4. C++ transfer Python

commonly pybind11 Are used to give C++ Code encapsulation Python End interface , But the other way around C++ transfer Python Also supportive . just #include <pybind11/embed.h> The header file is ready to use , The interior is embedded CPython Interpreter to implement . It is also very simple and easy to use , At the same time, it has good readability , With direct call Python The interface is very similar . For example, to a numpy Array calls some methods , Reference examples are as follows ：

// C++
pyVec = pyVec.attr("transpose")().attr("reshape")(pyVec.size());

# Python
pyVec = pyVec.transpose().reshape(pyVec.size)

The following is based on our C++ GPU High performance frame extraction so For example , In addition to providing frame extraction interface to Python End calls , You need to call back to Python So as to inform the frame drawing progress and frame data .

Python End callback interface ：

def on_decoding_callback(task_id:str, progress:int):
print("decoding callback, task id: %s, progress: %d" % (task_id, progress))
if __name__ == "__main__":
decoder = DecoderWrapper()
decoder.register_py_callback(os.getcwd() + "/decode_test.py",
"on_decoding_callback")

C++ End interface registration & Callback Python：

#include <pybind11/embed.h>
int DecoderWrapper::register_py_callback(const std::string &py_path,
const std::string &func_name) {
int ret = 0;
const std::string &pyPath = py_get_module_path(py_path);
const std::string &pyName = py_get_module_name(py_path);
SoInfo("get py module name: %s, path: %s", pyName.c_str(), pyPath.c_str());
py::gil_scoped_acquire acquire;
py::object sys = py::module::import("sys");
sys.attr("path").attr("append")(py::str(pyPath.c_str())); //Python The path of the script
py::module pyModule = py::module::import(pyName.c_str());
if (pyModule == NULL) {
LogError("Failed to load pyModule ..");
py::gil_scoped_release release;
return PYTHON_FILE_NOT_FOUND_ERROR;
}
if (py::hasattr(pyModule, func_name.c_str())) {
py_callback = pyModule.attr(func_name.c_str());
} else {
ret = PYTHON_FUNC_NOT_FOUND_ERROR;
}
py::gil_scoped_release release;
return ret;
}
int DecoderListener::on_decoding_progress(std::string &task_id, int progress) {
if (py_callback != NULL) {
try {
py::gil_scoped_acquire acquire;
py_callback(task_id, progress);
py::gil_scoped_release release;
} catch (py::error_already_set const &PythonErr) {
LogError("catched Python exception: %s", PythonErr.what());
} catch (const std::exception &e) {
LogError("catched exception: %s", e.what());
} catch (...) {
LogError("catched unknown exception");
}
}
}

5. Data type conversion

5.1 Class member function

For classes and member functions binding, First, you need to construct objects , So there are two steps ： The first step is the construction method of packaging instance , Another step is to register the access method of member functions . meanwhile , Also support passing def_static、def_readwrite To bind static methods or member variables , For details, please refer to the official documents 3.

#include <pybind11/pybind11.h>
class Hello
{
public:
Hello(){}
void say( const std::string s ){
std::cout << s << std::endl;
}
};
PYBIND11_MODULE(py2cpp, m) {
m.doc() = "pybind11 example";
pybind11::class_<Hello>(m, "Hello")
.def(pybind11::init()) // Constructors , Corresponding c++ Class constructor , If there is no declaration or the parameters are wrong , Will cause the call to fail
.def( "say", &Hello::say );
}
/*
Python Call mode ：
c = py2cpp.Hello()
c.say()
*/

5.2 STL Containers

pybind11 Support STL Automatic container conversion , When it needs to be handled STL When the container , Just include additional header files <pybind11/stl.h> that will do .pybind11 The automatic conversion provided includes ：std::vector<>/std::list<>/std::array<> convert to Python list ;std::set<>/std::unordered_set<> convert to Python set ; std::map<>/std::unordered_map<> convert to dict etc. . Besides std::pair<> and std::tuple<> The transformation of is also <pybind11/pybind11.h> The header file provides .

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
class ContainerTest {
public:
ContainerTest() {}
void Set(std::vector<int> v) {
mv = v;
}
private:
std::vector<int> mv;
};
PYBIND11_MODULE( py2cpp, m ) {
m.doc() = "pybind11 example";
pybind11::class_<ContainerTest>(m, "CTest")
.def( pybind11::init() )
.def( "set", &ContainerTest::Set );
}
/*
Python Call mode ：
c = py2cpp.CTest()
c.set([1,2,3])
*/

5.3 bytes、string Type passing

Because in Python3 in string The default type is UTF-8 code , If from C++ End transport string Type of protobuf Data to Python, It will appear “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte” The error of .

Solution ：pybind11 Provides non text data binding type py::bytes：

m.def("return_bytes",
[]() {
std::string s("\xba\xd0\xba\xd0"); // Not valid UTF-8
return py::bytes(s); // Return the data without transcoding
}
);

5.4 Intelligent pointer

std::unique_ptr pybind11 Support direct conversion ：

std::unique_ptr<Example> create_example() { return std::unique_ptr<Example>(new Example()); }
m.def("create_example", &create_example);

std::shared_ptr Here's the thing to watch out for , You cannot use bare pointers directly . As follows get_child Function in Python The end call will report memory access exception （ Such as segmentation fault）.

class Child { };
class Parent {
public:
Parent() : child(std::make_shared<Child>()) { }
Child *get_child() { return child.get(); } /* Hint: ** DON'T DO THIS ** */
private:
std::shared_ptr<Child> child;
};
PYBIND11_MODULE(example, m) {
py::class_<Child, std::shared_ptr<Child>>(m, "Child");
py::class_<Parent, std::shared_ptr<Parent>>(m, "Parent")
.def(py::init<>())
.def("get_child", &Parent::get_child);
}

5.5 cv::Mat To numpy transformation

The frame extraction result is returned to Python End time , Due to the present pybind11 Automatic conversion is not supported at present cv::Mat data structure , Therefore, it needs to be handled manually C++ cv::Mat and Python End numpy Binding between . The conversion code is as follows ：

/*
Python->C++ Mat
*/
cv::Mat numpy_uint8_3c_to_cv_mat(py::array_t<uint8_t>& input) {
if (input.ndim() != 3)
throw std::runtime_error("3-channel image must be 3 dims ");
py::buffer_info buf = input.request();
cv::Mat mat(buf.shape[0], buf.shape[1], CV_8UC3, (uint8_t*)buf.ptr);
return mat;
}
/*
C++ Mat ->numpy
*/
py::array_t<uint8_t> cv_mat_uint8_3c_to_numpy(cv::Mat& input) {
py::array_t<uint8_t> dst = py::array_t<uint8_t>({ input.rows,input.cols,3}, input.data);
return dst;
}

5.6 zero copy

Generally speaking, cross language calls produce performance overhead, Especially for the transmission of large data blocks . therefore ,pybind11 It also supports the way of data address transmission , Avoid the copy operation of large data blocks in memory , The performance is greatly improved .

class Matrix {
public:
Matrix(size_t rows, size_t cols) : m_rows(rows), m_cols(cols) {
m_data = new float[rows*cols];
}
float *data() { return m_data; }
size_t rows() const { return m_rows; }
size_t cols() const { return m_cols; }
private:
size_t m_rows, m_cols;
float *m_data;
};
py::class_<Matrix>(m, "Matrix", py::buffer_protocol())
.def_buffer([](Matrix &m) -> py::buffer_info {
return py::buffer_info(
m.data(), /* Pointer to buffer */
sizeof(float), /* Size of one scalar */
py::format_descriptor<float>::format(), /* Python struct-style format descriptor */
2, /* Number of dimensions */
{ m.rows(), m.cols() }, /* Buffer dimensions */
{ sizeof(float) * m.cols(), /* Strides (in bytes) for each index */
sizeof(float) }
);
});

6. to ground & Industry Applications

The above plan , We have been advertising multimedia AI Color extraction related services 、GPU High performance frame extraction and other algorithms , Achieved a very good acceleration effect . In the industry , Most of them are on the market at present AI Computing framework , Such as TensorFlow、Pytorch、 Ali X-Deep Learning、 Baidu PaddlePaddle etc. , Both use pybind11 To provide C++ To Python End interface encapsulation , Its stability and performance have been widely verified .

7. Conclusion

stay AI In this field, it is common to increase revenue and reduce expenditure 、 Under the background of reducing cost and improving efficiency , How to make full use of existing resources , Improving resource utilization is the key . This article provides a very convenient upgrade Python Algorithm service performance , as well as CPU Utilization solution , And achieved good results online . besides , Tencent also has some other Python Acceleration program , For example, at present TEG The compilation optimization team of is doing Python The optimization of the interpreter , You can also look forward to it in the future .