[WIP] Add a mutual information CQM selector by juansebastianl · Pull Request #16 · dwavesystems/dwave-scikit-learn-plugin

juansebastianl · 2023-05-19T01:10:59Z

This pull requests aims to add a mutual information feature selector by constructing a mutual information matrix where the diagonal elements are $I(Y;X_i)$ and the off-diagonal elements are $I(Y;X_i,X_j)$. This is estimated via discrete approximation which is $O(m^2 \cdot (\log(n))^3)$ in time and space complexity, where the number of observations is $n$ and the number of features is $m$. I want to improve this, at least in space complexity because a lot of the large matrix is sparse and memory becomes a constraint quickly, and there are still some numerical errors in this version.

arcondello · 2023-05-19T15:39:44Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+
+cdef extern from "math.h":
+    double log(double x) nogil


Suggested change

cdef extern from "math.h":

double log(double x) nogil

from libc.math cimport log

You can see it here. If you're wondering how to find what Cython actually has already implemented... the answer is that I literally search the source files every. single. time. 😄

arcondello · 2023-05-19T15:43:21Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+cdef extern from "math.h":
+    double log(double x) nogil
+
+def calculate_mi(np.ndarray[np.float_t,ndim = 2] X,


In modern Cython, they want us to use Typed Memoryviews. So for instance here it would be

Suggested change

def calculate_mi(np.ndarray[np.float_t,ndim = 2] X,

def calculate_mi(np.float_t[:] X,

arcondello · 2023-05-19T15:44:12Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+                np.ndarray[np.float_t,ndim = 1] y,
+                unsigned long refinement_factor = 5): 
+
+    cdef unsigned long n_obs = X.shape[0]


In general, I have found it far less headache when going back and forth between NumPy and Cython to use fixed-width types everywhere. So

Suggested change

cdef unsigned long n_obs = X.shape[0]

cdef np.uint64_t n_obs = X.shape[0]

Cython, because everything gets moderated through an intermediate generated file, tends to get confused about typing, and keeping everything fixed tends to help.

arcondello · 2023-05-19T15:45:41Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+        raise ValueError("y is the wrong shape")
+
+    cdef long sub_index_size = refinement_factor*int(np.round(log(n_obs))) + 2
+    cdef total_state_space_size = pow(sub_index_size,3)


I think you want pow?

arcondello · 2023-05-19T15:48:39Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+
+cdef extern from "math.h":
+    double log(double x) nogil
+


You'll get a pretty big performance boost (at the cost of safety) by using

Suggested change

@cython.boundscheck(False) # Deactivate bounds checking

@cython.wraparound(False) # Deactivate negative indexing.

see Cython for NumPy users which is an excellent tutorial.

arcondello · 2023-05-19T15:50:52Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+
+cdef assign_nn(np.ndarray[np.float_t,ndim = 2] X,
+                long sub_index_size, 
+                short seed = 12121):


In general, it's preferred to to make the default seed None.

Also, in this case you're better off not specifying a C type for seed. This is because it's only passed to a python function, np.random.default_rng(seed), below. So if it starts out as a C-type, it will have to be converted to a Python type first. Of course, in this case the performance hit is so minor as to be pointless... but the other advantage of using object here rather than short is it supports None or another rng.

arcondello · 2023-05-19T15:52:24Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+        return query(query_number, query_list[:new_len], idx - new_idx_unit)
+    else: 
+        #print("query : ", query_number, " pivot: ", pivot_value, " new index: ", idx + new_idx_unit)
+        return query(query_number, query_list[new_len:], idx + new_idx_unit)


I think you may be able to get tail recursion if instead of using query_list[new_len:] you pass the index. query_list[new_len:] makes a new object, a view, so it doesn't copy the underlying data, but it's still a new object being created.

arcondello · 2023-05-19T15:55:25Z

dwave/plugins/sklearn/nearest_neighbors.pyx

+    cdef long[::1] full_index = np.zeros(n_obs, dtype=long)
+
+    for i in range(n_obs): 
+        full_index[i] = i


Suggested change

cdef long[::1] full_index = np.zeros(n_obs, dtype=long)

for i in range(n_obs):

full_index[i] = i

cdef np.int64_t[::1] full_index = np.ones(n_obs, dtype=np.int64)

arcondello · 2023-05-19T15:57:48Z

setup.py

+    install_requires=[
+        "numpy",
+    ],
+    ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]), 


Suggested change

ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]),

ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"], annotate=True),

Cython annotations are super helpful. We do some stuff with environment variables in some of our other packages (e.g. dimod) but there's no harm in just turning it on always. At some point I'll get around to making a PR to dimod to just change that to always be True.

mutual info 1st pass

ae1ae10

arcondello reviewed May 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add a mutual information CQM selector#16

[WIP] Add a mutual information CQM selector#16
juansebastianl wants to merge 1 commit intodwavesystems:mainfrom
juansebastianl:feature/mutual_info

juansebastianl commented May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

arcondello May 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def calculate_mi(np.ndarray[np.float_t,ndim = 2] X,
	def calculate_mi(np.float_t[:] X,

	cdef unsigned long n_obs = X.shape[0]
	cdef np.uint64_t n_obs = X.shape[0]



	@cython.boundscheck(False) # Deactivate bounds checking
	@cython.wraparound(False) # Deactivate negative indexing.

	ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"]),
	ext_modules = cythonize(["./dwave/plugins/sklearn/nearest_neighbors.pyx"], annotate=True),

Conversation

juansebastianl commented May 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants