Skip to content

PCovC takes forever to fit on sparse fingerprint without data scaling #277

@Senpoo009

Description

@Senpoo009

A 2048-bit RDKit fingerprint is performed on X_train.

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(5807, 2048))

Fitting PCovC to this feature array takes around 5-8 minutes to run.

lr = LogisticRegression()
lr.fit(X_train,y_train)

pcovc = PCovC(mixing=0.05,classifier=lr,n_components=2)
pcovc.fit(X_train,y_train)

Now, X_train is fit transformed with a StandardScaler.

array([[-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 , -0.92512945],
       [-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 , -0.92512945],
       [-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 , -0.92512945],
       ...,
       [-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 ,  1.08092981],
       [-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 , -0.92512945],
       [-0.59651099, -0.46626218, -0.49935414, ..., -0.34791159,
        -0.4461504 , -0.92512945]], shape=(5807, 2048))

Fitting PCovC to this scaled feature array takes around 15 seconds to run.

lr = LogisticRegression()
lr.fit(X_train_scaled,y_train)

pcovc = PCovC(mixing=0.05,classifier=lr,n_components=2)
pcovc.fit(X_train_scaled,y_train)

It seems strange that the unscaled fingerprint would take so much longer than the scaled fingerprint to fit. What could be causing this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions