at present AI Algorithm development, especially training, is basically based on Python Mainly , Mainstream AI Calculation framework, such as TensorFlow、PyTorch And so on Python Interface . There is a good saying , Life is too short , I use Python. But because of Python Dynamic language , Explain execution and lack of maturity JIT programme , In computing intensive scenarios, multi-core concurrency is limited , It is difficult to directly meet the real-time requirements of high performance Serving demand . In some scenarios with high performance requirements , Still need to use C/C++ To solve . But if the algorithm students are required to use it all C++ To develop online reasoning services , And the cost is very high , Lead to development efficiency and waste of resources . therefore , If there is a light way to Python And part C++ Combined with the core code written , It can achieve the effect of ensuring both development efficiency and service performance . This paper mainly introduces pybind11 In Tencent advertising multimedia AI Python Accelerated practice of algorithm , And some experience summary in the process .
Python The official provided Python/C API, Can achieve 「 use C Language writing Python library 」, Let's start with the last piece of code and feel :
static PyObject * spam_system(PyObject *self, PyObject *args) { const char *command; int sts; if (!PyArg_ParseTuple(args, "s", &command)) return NULL; sts = system(command); return PyLong_FromLong(sts); }
It can be seen that the transformation cost is very high , All basic types must be manually changed to CPython Interpreter encapsulated binding type . It's not hard to understand , why Python The official website also recommends that you use third-party solutions 1.
Cython The main connection is Python and C, Convenient for Python To write C Expand .Cython The compiler supports conversion Python The code is C Code , these C The code can call Python/C Of API. essentially ,Cython That is to say, it contains C Data type Python. at present Python Of numpy, And our factory tRPC-Python The framework is applied .
shortcoming :
SIWG It mainly solves other high-level languages and C and C++ The problem of language interaction , Supports more than a dozen programming languages , Including common java、C#、javascript、Python etc. . You need to use *.i File definition interface , Then use tools to generate cross language interaction code . However, due to the large number of languages supported , So in Python The end performance is not very good .
It is worth mentioning that ,TensorFlow Early also used SWIG To encapsulate Python Interface , Formally due to SIWG The performance is not good enough 、 Build complex 、 The binding code is obscure and difficult to read ,TensorFlow Has been in 2019 The year will be SIWG Switch to a pybind112.
C++ Widely used in Boost Open source library , Also provided Python binding function . On use , Simplify through macro definition and metaprogramming Python Of API call . But the biggest disadvantage is the need to rely on huge Boost library , The burden of compilation and dependencies is heavy , Only for solving Python binding If so, there is a visual sense of anti-aircraft shelling mosquitoes .
It can be understood as taking Boost.Python As a blueprint , Available only Python & C++ binding A compact version of the function , be relative to Boost.Python stay binary size And compilation speed has many advantages . Yes C++ Support is very good , be based on C++11 Various new features have been applied , Maybe pybind11 The suffix 11 That's why .
Pybind11 adopt C++ Compile time introspection to infer type information , To minimize traditional expansion Python Module is a complex template code , And implements common data types , Such as STL data structure 、 Intelligent pointer 、 class 、 function overloading 、 Example method Python Automatic conversion of , Functions can receive and return values of custom data types 、 Pointer or reference .
characteristic :
“Talk is cheap, show me your code.” Three lines of code can quickly implement binding , You deserve it :
PYBIND11_MODULE (libcppex, m) { m.def("add", [](int a, int b) -> int { return a + b; }); }
GIL(Global Interpreter Lock) Global interpreter lock : Only one thread is allowed to use the interpreter in a process at the same time , As a result, multithreading cannot really use multi-core . Because the thread holding the lock is executing to I/O Some wait operations such as dense functions will be released automatically GIL lock , So for I/O For intensive services , Multithreading is effective . But for the CPU Intensive operation , Because only one thread can actually perform calculation at a time , The impact on performance can be imagined .
It must be said that ,GIL Not at all Python Its own flaws , But now Python Default CPython Thread safety protection lock introduced by parser . We generally say Python There is GIL lock , In fact, only for CPython Interpreter . So if we can find a way to avoid GIL lock , Will it have a good acceleration effect ? The answer is yes , One solution is to use other interpreters instead, such as pypy etc. , But for mature C The compatibility of extension libraries is not good enough , Maintenance costs are high . Another option , It is through C/C++ Extensions to encapsulate compute intensive code , And remove... On execution GIL lock .
pybind11 It provides in C++ End manual release GIL The interface of the lock , therefore , We just need to put the part of the code that is computationally intensive , Transform into C++ Code , And release... Before and after execution / obtain GIL lock ,Python The multi-core computing power of the algorithm is unlocked . Of course , In addition to displaying the release of the calling interface GIL In addition to the method of locking , It can also be in C++ Internally switch computing intensive code to other C++ Threads execute asynchronously , It can also avoid GIL The lock utilizes multi-core .
Let's say 100 Take the calculation of spherical distance between 10000 cities as an example , contrast C++ Performance difference before and after expansion :
C++ End :
#include <math.h> #include <stdio.h> #include <time.h> #include <pybind11/embed.h> namespace py = pybind11; const double pi = 3.1415926535897932384626433832795; double rad(double d) { return d * pi / 180.0; } double geo_distance(double lon1, double lat1, double lon2, double lat2, int test_cnt) { py::gil_scoped_release release; // Release GIL lock double a, b, s; double distance = 0; for (int i = 0; i < test_cnt; i++) { double radLat1 = rad(lat1); double radLat2 = rad(lat2); a = radLat1 - radLat2; b = rad(lon1) - rad(lon2); s = pow(sin(a/2),2) + cos(radLat1) * cos(radLat2) * pow(sin(b/2),2); distance = 2 * asin(sqrt(s)) * 6378 * 1000; } py::gil_scoped_acquire acquire; // C++ Resume before execution ends GIL lock return distance; } PYBIND11_MODULE (libcppex, m) { m.def("geo_distance", &geo_distance, R"pbdoc( Compute geography distance between two places. )pbdoc"); }
Python Calling end :
import sys import time import math import threading from libcppex import * def rad(d): return d * 3.1415926535897932384626433832795 / 180.0 def geo_distance_py(lon1, lat1, lon2, lat2, test_cnt): distance = 0 for i in range(test_cnt): radLat1 = rad(lat1) radLat2 = rad(lat2) a = radLat1 - radLat2 b = rad(lon1) - rad(lon2) s = math.sin(a/2)**2 + math.cos(radLat1) * math.cos(radLat2) * math.sin(b/2)**2 distance = 2 * math.asin(math.sqrt(s)) * 6378 * 1000 print(distance) return distance def call_cpp_extension(lon1, lat1, lon2, lat2, test_cnt): res = geo_distance(lon1, lat1, lon2, lat2, test_cnt) print(res) return res if __name__ == "__main__": threads = [] test_cnt = 1000000 test_type = sys.argv[1] thread_cnt = int(sys.argv[2]) start_time = time.time() for i in range(thread_cnt): if test_type == 'p': t = threading.Thread(target=geo_distance_py, args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,)) elif test_type == 'c': t = threading.Thread(target=call_cpp_extension, args=(113.973129, 22.599578, 114.3311032, 22.6986848, test_cnt,)) threads.append(t) t.start() for thread in threads: thread.join() print('calc time = %d' % int((time.time() - start_time) * 1000))
Performance comparison :
Conclusion :
Compute intensive code , Simply change to C++ Achieve good performance improvement , Release in multithreading GIL Under the blessing of the lock , Make full use of multicore , Performance easily get linear speedup , Greatly improve the utilization of resources . Although it can also be used in actual scenes Python Multi process approach to take advantage of multi-core , But when the model is getting bigger and bigger, it is often dozens of G Under the trend of , Too much memory is used, don't say , Frequent switching between processes context switching overhead, And the performance differences of the language itself , Cause and C++ There are still many gaps in the expansion mode .
( notes : Above tests demo github Address :https://github.com/jesonxiang/cpp_extension_pybind11, The test environment is CPU 10 Nuclear vessel , If you are interested, you can also do performance verification .)
Compile instructions :
g++ -Wall -shared -std=gnu++11 -O2 -fvisibility=hidden -fPIC -I./ perfermance.cc -o libcppex.so `Python3-config --cflags --ldflags --libs`
If Python The environment is not configured correctly, and an error may be reported :
Here to Python Our dependence is through Python3-config --cflags --ldflags --libs To automatically specify , You can run this command alone to verify Python Whether the dependency is configured correctly .Python3-config Normal execution depends on Python3-dev, You can install it with the following command :
yum install Python3-devel
commonly pybind11 Are used to give C++ Code encapsulation Python End interface , But the other way around C++ transfer Python Also supportive . just #include <pybind11/embed.h> The header file is ready to use , The interior is embedded CPython Interpreter to implement . It is also very simple and easy to use , At the same time, it has good readability , With direct call Python The interface is very similar . For example, to a numpy Array calls some methods , Reference examples are as follows :
// C++ pyVec = pyVec.attr("transpose")().attr("reshape")(pyVec.size());
# Python pyVec = pyVec.transpose().reshape(pyVec.size)
The following is based on our C++ GPU High performance frame extraction so For example , In addition to providing frame extraction interface to Python End calls , You need to call back to Python So as to inform the frame drawing progress and frame data .
Python End callback interface :
def on_decoding_callback(task_id:str, progress:int): print("decoding callback, task id: %s, progress: %d" % (task_id, progress)) if __name__ == "__main__": decoder = DecoderWrapper() decoder.register_py_callback(os.getcwd() + "/decode_test.py", "on_decoding_callback")
C++ End interface registration & Callback Python:
#include <pybind11/embed.h> int DecoderWrapper::register_py_callback(const std::string &py_path, const std::string &func_name) { int ret = 0; const std::string &pyPath = py_get_module_path(py_path); const std::string &pyName = py_get_module_name(py_path); SoInfo("get py module name: %s, path: %s", pyName.c_str(), pyPath.c_str()); py::gil_scoped_acquire acquire; py::object sys = py::module::import("sys"); sys.attr("path").attr("append")(py::str(pyPath.c_str())); //Python The path of the script py::module pyModule = py::module::import(pyName.c_str()); if (pyModule == NULL) { LogError("Failed to load pyModule .."); py::gil_scoped_release release; return PYTHON_FILE_NOT_FOUND_ERROR; } if (py::hasattr(pyModule, func_name.c_str())) { py_callback = pyModule.attr(func_name.c_str()); } else { ret = PYTHON_FUNC_NOT_FOUND_ERROR; } py::gil_scoped_release release; return ret; } int DecoderListener::on_decoding_progress(std::string &task_id, int progress) { if (py_callback != NULL) { try { py::gil_scoped_acquire acquire; py_callback(task_id, progress); py::gil_scoped_release release; } catch (py::error_already_set const &PythonErr) { LogError("catched Python exception: %s", PythonErr.what()); } catch (const std::exception &e) { LogError("catched exception: %s", e.what()); } catch (...) { LogError("catched unknown exception"); } } }
For classes and member functions binding, First, you need to construct objects , So there are two steps : The first step is the construction method of packaging instance , Another step is to register the access method of member functions . meanwhile , Also support passing def_static、def_readwrite To bind static methods or member variables , For details, please refer to the official documents 3.
#include <pybind11/pybind11.h> class Hello { public: Hello(){} void say( const std::string s ){ std::cout << s << std::endl; } }; PYBIND11_MODULE(py2cpp, m) { m.doc() = "pybind11 example"; pybind11::class_<Hello>(m, "Hello") .def(pybind11::init()) // Constructors , Corresponding c++ Class constructor , If there is no declaration or the parameters are wrong , Will cause the call to fail .def( "say", &Hello::say ); } /* Python Call mode : c = py2cpp.Hello() c.say() */
pybind11 Support STL Automatic container conversion , When it needs to be handled STL When the container , Just include additional header files <pybind11/stl.h> that will do .pybind11 The automatic conversion provided includes :std::vector<>/std::list<>/std::array<> convert to Python list ;std::set<>/std::unordered_set<> convert to Python set ; std::map<>/std::unordered_map<> convert to dict etc. . Besides std::pair<> and std::tuple<> The transformation of is also <pybind11/pybind11.h> The header file provides .
#include <iostream> #include <pybind11/pybind11.h> #include <pybind11/stl.h> class ContainerTest { public: ContainerTest() {} void Set(std::vector<int> v) { mv = v; } private: std::vector<int> mv; }; PYBIND11_MODULE( py2cpp, m ) { m.doc() = "pybind11 example"; pybind11::class_<ContainerTest>(m, "CTest") .def( pybind11::init() ) .def( "set", &ContainerTest::Set ); } /* Python Call mode : c = py2cpp.CTest() c.set([1,2,3]) */
Because in Python3 in string The default type is UTF-8 code , If from C++ End transport string Type of protobuf Data to Python, It will appear “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte” The error of .
Solution :pybind11 Provides non text data binding type py::bytes:
m.def("return_bytes", []() { std::string s("\xba\xd0\xba\xd0"); // Not valid UTF-8 return py::bytes(s); // Return the data without transcoding } );
std::unique_ptr<Example> create_example() { return std::unique_ptr<Example>(new Example()); } m.def("create_example", &create_example);
class Child { }; class Parent { public: Parent() : child(std::make_shared<Child>()) { } Child *get_child() { return child.get(); } /* Hint: ** DON'T DO THIS ** */ private: std::shared_ptr<Child> child; }; PYBIND11_MODULE(example, m) { py::class_<Child, std::shared_ptr<Child>>(m, "Child"); py::class_<Parent, std::shared_ptr<Parent>>(m, "Parent") .def(py::init<>()) .def("get_child", &Parent::get_child); }
The frame extraction result is returned to Python End time , Due to the present pybind11 Automatic conversion is not supported at present cv::Mat data structure , Therefore, it needs to be handled manually C++ cv::Mat and Python End numpy Binding between . The conversion code is as follows :
/* Python->C++ Mat */ cv::Mat numpy_uint8_3c_to_cv_mat(py::array_t<uint8_t>& input) { if (input.ndim() != 3) throw std::runtime_error("3-channel image must be 3 dims "); py::buffer_info buf = input.request(); cv::Mat mat(buf.shape[0], buf.shape[1], CV_8UC3, (uint8_t*)buf.ptr); return mat; } /* C++ Mat ->numpy */ py::array_t<uint8_t> cv_mat_uint8_3c_to_numpy(cv::Mat& input) { py::array_t<uint8_t> dst = py::array_t<uint8_t>({ input.rows,input.cols,3}, input.data); return dst; }
Generally speaking, cross language calls produce performance overhead, Especially for the transmission of large data blocks . therefore ,pybind11 It also supports the way of data address transmission , Avoid the copy operation of large data blocks in memory , The performance is greatly improved .
class Matrix { public: Matrix(size_t rows, size_t cols) : m_rows(rows), m_cols(cols) { m_data = new float[rows*cols]; } float *data() { return m_data; } size_t rows() const { return m_rows; } size_t cols() const { return m_cols; } private: size_t m_rows, m_cols; float *m_data; }; py::class_<Matrix>(m, "Matrix", py::buffer_protocol()) .def_buffer([](Matrix &m) -> py::buffer_info { return py::buffer_info( m.data(), /* Pointer to buffer */ sizeof(float), /* Size of one scalar */ py::format_descriptor<float>::format(), /* Python struct-style format descriptor */ 2, /* Number of dimensions */ { m.rows(), m.cols() }, /* Buffer dimensions */ { sizeof(float) * m.cols(), /* Strides (in bytes) for each index */ sizeof(float) } ); });
The above plan , We have been advertising multimedia AI Color extraction related services 、GPU High performance frame extraction and other algorithms , Achieved a very good acceleration effect . In the industry , Most of them are on the market at present AI Computing framework , Such as TensorFlow、Pytorch、 Ali X-Deep Learning、 Baidu PaddlePaddle etc. , Both use pybind11 To provide C++ To Python End interface encapsulation , Its stability and performance have been widely verified .
stay AI In this field, it is common to increase revenue and reduce expenditure 、 Under the background of reducing cost and improving efficiency , How to make full use of existing resources , Improving resource utilization is the key . This article provides a very convenient upgrade Python Algorithm service performance , as well as CPU Utilization solution , And achieved good results online . besides , Tencent also has some other Python Acceleration program , For example, at present TEG The compilation optimization team of is doing Python The optimization of the interpreter , You can also look forward to it in the future .
1(https://docs.Python.org/3/extending/index.html#extending-index)
2(https://github.com/tensorflow/community/blob/master/rfcs/20190208-pybind11.md#replace-swig-with-pybind11)
3(https://pybind11.readthedocs.io/en/stable/advanced/cast/index.html)