[WIP] Add a mutual information CQM selector#16
[WIP] Add a mutual information CQM selector#16juansebastianl wants to merge 1 commit intodwavesystems:mainfrom
Conversation
|
|
||
| cdef extern from "math.h": | ||
| double log(double x) nogil |
There was a problem hiding this comment.
| cdef extern from "math.h": | |
| double log(double x) nogil | |
| from libc.math cimport log |
You can see it here. If you're wondering how to find what Cython actually has already implemented... the answer is that I literally search the source files every. single. time. 😄
| cdef extern from "math.h": | ||
| double log(double x) nogil | ||
|
|
||
| def calculate_mi(np.ndarray[np.float_t,ndim = 2] X, |
There was a problem hiding this comment.
In modern Cython, they want us to use Typed Memoryviews. So for instance here it would be
| def calculate_mi(np.ndarray[np.float_t,ndim = 2] X, | |
| def calculate_mi(np.float_t[:] X, |
| np.ndarray[np.float_t,ndim = 1] y, | ||
| unsigned long refinement_factor = 5): | ||
|
|
||
| cdef unsigned long n_obs = X.shape[0] |
There was a problem hiding this comment.
In general, I have found it far less headache when going back and forth between NumPy and Cython to use fixed-width types everywhere. So
| cdef unsigned long n_obs = X.shape[0] | |
| cdef np.uint64_t n_obs = X.shape[0] |
Cython, because everything gets moderated through an intermediate generated file, tends to get confused about typing, and keeping everything fixed tends to help.
| raise ValueError("y is the wrong shape") | ||
|
|
||
| cdef long sub_index_size = refinement_factor*int(np.round(log(n_obs))) + 2 | ||
| cdef total_state_space_size = pow(sub_index_size,3) |
|
|
||
| cdef extern from "math.h": | ||
| double log(double x) nogil | ||
|
|
There was a problem hiding this comment.
You'll get a pretty big performance boost (at the cost of safety) by using
| @cython.boundscheck(False) # Deactivate bounds checking | |
| @cython.wraparound(False) # Deactivate negative indexing. |
see Cython for NumPy users which is an excellent tutorial.
|
|
||
| cdef assign_nn(np.ndarray[np.float_t,ndim = 2] X, | ||
| long sub_index_size, | ||
| short seed = 12121): |
There was a problem hiding this comment.
In general, it's preferred to to make the default seed None.
Also, in this case you're better off not specifying a C type for seed. This is because it's only passed to a python function, np.random.default_rng(seed), below. So if it starts out as a C-type, it will have to be converted to a Python type first. Of course, in this case the performance hit is so minor as to be pointless... but the other advantage of using object here rather than short is it supports None or another rng.
| return query(query_number, query_list[:new_len], idx - new_idx_unit) | ||
| else: | ||
| #print("query : ", query_number, " pivot: ", pivot_value, " new index: ", idx + new_idx_unit) | ||
| return query(query_number, query_list[new_len:], idx + new_idx_unit) |
There was a problem hiding this comment.
I think you may be able to get tail recursion if instead of using query_list[new_len:] you pass the index. query_list[new_len:] makes a new object, a view, so it doesn't copy the underlying data, but it's still a new object being created.
| cdef long[::1] full_index = np.zeros(n_obs, dtype=long) | ||
|
|
||
| for i in range(n_obs): | ||
| full_index[i] = i |
There was a problem hiding this comment.
| cdef long[::1] full_index = np.zeros(n_obs, dtype=long) | |
| for i in range(n_obs): | |
| full_index[i] = i | |
| cdef np.int64_t[::1] full_index = np.ones(n_obs, dtype=np.int64) |
| install_requires=[ | ||
| "numpy", | ||
| ], | ||
| ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]), |
There was a problem hiding this comment.
| ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]), | |
| ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"], annotate=True), |
Cython annotations are super helpful. We do some stuff with environment variables in some of our other packages (e.g. dimod) but there's no harm in just turning it on always. At some point I'll get around to making a PR to dimod to just change that to always be True.
This pull requests aims to add a mutual information feature selector by constructing a mutual information matrix where the diagonal elements are$I(Y;X_i)$ and the off-diagonal elements are $I(Y;X_i,X_j)$ . This is estimated via discrete approximation which is $O(m^2 \cdot (\log(n))^3)$ in time and space complexity, where the number of observations is $n$ and the number of features is $m$ . I want to improve this, at least in space complexity because a lot of the large matrix is sparse and memory becomes a constraint quickly, and there are still some numerical errors in this version.