Native Compilation of Numba on M1-based Mac

2022-11-30
7 min read

This is cross-published on my HackMD

Motivation

The current installation instructions of Numba face some difficulties induced by the way Apple packages its software development toolkit. This note documents how to set Numba up from start to end on an M1-based Mac, and where alternate approaches fail. We are building with support for OpenMP to utilize all available threads, but without support for Threading Building Blocks, which are unsupported on Mac, and likewise without support for CUDA, which is also not available on Mac.

Building Numba from Source for Local Development with OpenMP-Support

To build Numba on Mac (or any platform for that matter) we need to begin by creating a conda environment with the base-dependencies

conda create -n numbaenv python=3.10 numba/label/dev::llvmlite numpy scipy jinja2 cffi

and activate the environment

conda activate numbaenv

At which point we have installed llvmlite for the LLVM JIT-engine underpinning Numba, whose installaton we can now verify from a Python REPL.

import llvmlite
llvmlite.__version__

We can now clone the source of Numba.

git clone git@github.com:numba/numba.git && cd numba

Preemptively disable Threading Building Blocks

export NUMBA_DISABLE_TBB=1

Point to the Mac-specific software development toolkit, the path to which can be found with xcrun --show-sdk-path

export SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

The important pointer to set the SDKROOT was found in the conda build-scripts of Numba, but downloading the older software development toolkit, as done in the build script, is not necessary on a modern Mac if you have already downloaded the software development toolkit libraries.

And install the OSX-specific Clang-compilers provided by Conda to not use Apple’s system compiler which have OpenMP disabled, and cannot import OpenMP or accept the -fopenmp flag at compilation time.

conda install clang clangdev

The command to build Numba itself, for development purposes with --noopt (no optimizations), and --debug (debugging options enables) is then

python setup.py build_ext --inplace --noopt --debug

If we would just want to use it locally without the need for the development-focussed deactivation of optimizations we would build with just the --inplace option

python setup.py build_ext --inplace

After which we can install the Numba wheel into our environment

python -m pip install --no-deps -e .

And verify its installation either from the REPL with

import numba
numba.__version__

or from the command line with Numba’s provided utility

numba -s

The output should look similar to this

System info:
--------------------------------------------------------------------------------
__Time Stamp__
Report started (local time)                   : 2022-11-26 23:48:15.001253
UTC start time                                : 2022-11-27 05:48:15.001264
Running time (s)                              : 0.718704

__Hardware Information__
Machine                                       : arm64
CPU Name                                      : cyclone
CPU Count                                     : 10
Number of accessible CPUs                     : ?
List of accessible CPUs cores                 : ?
CFS Restrictions (CPUs worth of runtime)      : None

CPU Features                                  : 

Memory Total (MB)                             : 32768
Free Memory (MB)                              : 23

__OS Information__
Platform Name                                 : macOS-13.0.1-arm64-arm-64bit
Platform Release                              : 22.1.0
OS Name                                       : Darwin
OS Version                                    : Darwin Kernel Version 22.1.0: Sun Oct  9 20:15:09 PDT 2022; root:xnu-8792.41.9~2/RELEASE_ARM64_T6000
OS Specific Version                           : 13.0.1   arm64
Libc Version                                  : ?

__Python Information__
Python Compiler                               : Clang 14.0.6 
Python Implementation                         : CPython
Python Version                                : 3.10.8
Python Locale                                 : None.UTF-8

__Numba Toolchain Versions__
Numba Version                                 : 0.57.0dev0+846.g728263512
llvmlite Version                              : 0.40.0dev0+43.g7783803

__LLVM Information__
LLVM Version                                  : 11.1.0

__CUDA Information__
CUDA Device Initialized                       : False
CUDA Driver Version                           : ?
CUDA Runtime Version                          : ?
CUDA NVIDIA Bindings Available                : ?
CUDA NVIDIA Bindings In Use                   : ?
CUDA Detect Output:
None
CUDA Libraries Test Output:
None

__NumPy Information__
NumPy Version                                 : 1.23.4
NumPy Supported SIMD features                 : ('NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD', 'FPHP', 'ASIMDHP', 'ASIMDDP')
NumPy Supported SIMD dispatch                 : ('ASIMDHP', 'ASIMDDP', 'ASIMDFHM')
NumPy Supported SIMD baseline                 : ('NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD')
NumPy AVX512_SKX support detected             : False

__SVML Information__
SVML State, config.USING_SVML                 : False
SVML Library Loaded                           : False
llvmlite Using SVML Patched LLVM              : True
SVML Operational                              : False

__Threading Layer Information__
TBB Threading Layer Available                 : False
+--> Disabled due to Unknown import problem.
OpenMP Threading Layer Available              : True
+-->Vendor: Intel
Workqueue Threading Layer Available           : True
+-->Workqueue imported successfully.

__Numba Environment Variable Information__
None found.

__Conda Information__
Conda Build                                   : 3.21.9
Conda Env                                     : 4.13.0
Conda Platform                                : osx-arm64
Conda Python Version                          : 3.9.11.final.0
Conda Root Writable                           : True

__Installed Packages__
blas                      1.0                    openblas  
bzip2                     1.0.8                h620ffc9_4  
ca-certificates           2022.10.11           hca03da5_0  
certifi                   2022.9.24       py310hca03da5_0  
cffi                      1.15.1          py310h80987f9_2  
clang                     14.0.6               hca03da5_0  
clang-14                  14.0.6          default_hf5194b7_0  
clang-format              14.0.6          default_hf5194b7_0  
clang-format-14           14.0.6          default_hf5194b7_0  
clang-tools               14.0.6          default_hf5194b7_0  
clangdev                  14.0.6          default_hf5194b7_0  
clangxx                   14.0.6          default_hf5194b7_0  
fftw                      3.3.9                h1a28f6b_1  
jinja2                    3.1.2           py310hca03da5_0  
libclang                  14.0.6          default_hf5194b7_0  
libclang-cpp              14.0.6          default_hf5194b7_0  
libclang-cpp14            14.0.6          default_hf5194b7_0  
libclang13                14.0.6          default_hf5a4b0a_0  
libcxx                    14.0.6               h848a8c0_0  
libffi                    3.4.2                hca03da5_6  
libgfortran               5.0.0           11_3_0_hca03da5_28  
libgfortran5              11.3.0              h009349e_28  
libllvm14                 14.0.6               h7ec7a93_1  
libopenblas               0.3.21               h269037a_0  
llvm-openmp               14.0.6               hc6e5704_0  
llvm-tools                14.0.6               h7ec7a93_1  
llvmdev                   14.0.6               h7ec7a93_1  
llvmlite                  0.40.0dev0             py310_43    numba/label/dev
markupsafe                2.1.1           py310h1a28f6b_0  
ncurses                   6.3                  h1a28f6b_3  
numba                     0.57.0.dev0+846.g728263512           dev_0    <develop>
numpy                     1.23.4          py310hb93e574_0  
numpy-base                1.23.4          py310haf87e8b_0  
openssl                   1.1.1s               h1a28f6b_0  
pip                       22.2.2          py310hca03da5_0  
pycparser                 2.21               pyhd3eb1b0_0  
python                    3.10.8               hc0d8a6c_1  
readline                  8.2                  h1a28f6b_0  
scipy                     1.9.3           py310h20cbe94_0  
setuptools                65.5.0          py310hca03da5_0  
sqlite                    3.40.0               h7a7dc30_0  
tk                        8.6.12               hb8d0fd4_0  
tzdata                    2022f                h04d1e81_0  
wheel                     0.37.1             pyhd3eb1b0_0  
xz                        5.2.6                h1a28f6b_0  
zlib                      1.2.13               h5a0b063_0  

No errors reported.


__Warning log__
Warning (cuda): CUDA driver library cannot be found or no CUDA enabled devices are present.
Exception class: <class 'numba.cuda.cudadrv.error.CudaSupportError'>
Warning (psutil): psutil cannot be imported. For more accuracy, consider installing it.
--------------------------------------------------------------------------------

Approaches not to take (for mental Sanity)

There are multiple approaches I attempted, but ended up aborting due to their instability, as well as the inherent fallacies with calling from multiple header directories with similar contents at the same time.

LLVM build from source for the Compiler

Circumvent the inability of the native Clang-compiler on Mac, we can try to build our own LLVM toolchain. As LLVMlite is based on a patched-up version of the release/14.x branch of LLVM, checking out the source for llvm on that branch would be the most natural option

git clone https://github.com/llvm/llvm-project && cd llvm-project
git checkout release/14.x
mkdir build && cd build

To then build the Clang-toolchain from source for the M1-based Mac with the OpenMP libraries

cmake -G Ninja ../llvm -DCMAKE_BUILD_TYPE=Release\
-DLLVM_ENABLE_ASSERTIONS=ON\
-DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;lld;openmp"\
-DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi"\
-DLLVM_LINK_LLVM_DYLIB=ON\
-DLLVM_ENABLE_EH=ON\
-DLLVM_ENABLE_FFI=ON\
-DLLVM_ENABLE_RTTI=ON\
-DLLVM_INCLUDE_DOCS=OFF\
-DLLVM_INSTALL_UTILS=ON\
-DLLVM_ENABLE_Z3_SOLVER=OFF\
-DLLVM_TARGETS_TO_BUILD="AArch64"\
-DLIBOMP_INSTALL_ALIASES=OFF\
-DLLVM_CREATE_XCODE_TOOLCHAIN=ON\
-DLLVM_BUILD_LLVM_C_DYLIB=ON\
-DLLVM_ENABLE_LIBCXX=ON\
-DRUNTIMES_CMAKE_ARGS=DCMAKE_INSTALL_RPATH="@loader_path/../lib"

We then have to export the $PATH, and $LD_LIBRARY_PATH paths for them to be found before we can continue the attempt to build Numba with this toolchain

export PATH=${PWD}/bin:$PATH
export LD_LIBRARY_PATH=${PWD}/lib:$LD_LIBRARY_PATH

So where does this begin to fail?

  • At first we are missing the stdio.h library, which we can get from the XCode Developers SDK by manually pointing to the directory of the headers library.
  • Where this approach really collapses is at the next attempt where we run into conflicting versions of header libraries, which would require a lot of manual linking calls to not search for the header libraries in the entire directories.

Python Virtualenv-based Install

Another typically logical approach to take would be to use a Python-based virtual environment, and hence avoid the Conda-induced isolation with Conda’s own libraries and just have a very thin virtualenv. To do this we’d

python3 -m venv numbaenv && source numbaenv/bin/activate
pip install llvmlite

at which point we can clone the Numba-repo and start a CPU-based build, which for the lack of abilities only run Numba Threads without Threading Building Blocks, or OpenMP

python setup.py build_ext --inplace --noopt --debug

So where does this particular approach begin to fail?

  • The Numba library built from main is incompatible with the version of the llvmlite library shipped by pip. As such, the two do not work together.

Setting CFlags, and LFlags as suggested on Stackoverflow

A typical suggestion on Stackoverflow, or similar sites is to set compiler, and linking flags at the command line to fix the compiler not finding certain header files, and setting a deployment target, i.e.

export CFLAGS="-I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include"
export LDFLAGS=-L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/lib
export MACOSX_DEPLOYMENT_TARGET=14.1

So why is this approach undesirable and ultimately fails?

  • We end up patching up the first missing libraries such as stdio.h, but the next missing library is then vector.h, and the deluge of missing libraries does not seem to stop. As such I would call this an unclean approach, which I ultimately did not manage to get to work.