Questions tagged [scikit-learn]
Scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and commercially usable (BSD license).
scikit-learn
28,326
questions
324
votes
25
answers
390k
views
Label encoding across multiple columns in scikit-learn
I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'...
317
votes
16
answers
1.1m
views
How to normalize a numpy array to a unit vector
I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if ...
272
votes
14
answers
538k
views
Is there a library function for Root mean square error (RMSE) in python?
I know I could implement a root mean squared error function like this:
def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())
What I'm looking for if this rmse ...
267
votes
27
answers
828k
views
sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error.
ValueError: Input contains NaN, infinity or a value too ...
264
votes
15
answers
497k
views
ImportError: No module named sklearn.cross_validation
I am using python 2.7 in Ubuntu 14.04. I installed scikit-learn, numpy and matplotlib with these commands:
sudo apt-get install build-essential python-dev python-numpy \
python-numpy-dev python-...
258
votes
7
answers
163k
views
Save classifier to disk in scikit-learn
How do I save a trained Naive Bayes classifier to disk and use it to predict data?
I have the following sample program from the scikit-learn website:
from sklearn import datasets
iris = datasets....
251
votes
11
answers
417k
views
Find p-value (significance) in scikit-learn LinearRegression
How can I find the p-value (significance) of each coefficient?
lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)
250
votes
9
answers
339k
views
pandas dataframe columns scaling with sklearn
I have a pandas dataframe with mixed type columns, and I'd like to apply sklearn's min_max_scaler to some of the columns. Ideally, I'd like to do these transformations in place, but haven't figured ...
249
votes
13
answers
245k
views
How to split data into 3 sets (train, validation and test)?
I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data in two sets (train and test). However, I ...
245
votes
9
answers
352k
views
A column-vector y was passed when a 1d array was expected
I need to fit RandomForestRegressor from sklearn.ensemble.
forest = ensemble.RandomForestRegressor(**RF_tuned_parameters)
model = forest.fit(train_fold, train_y)
yhat = model.predict(test_fold)
This ...
236
votes
11
answers
132k
views
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
234
votes
20
answers
358k
views
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
Importing from pyxdameraulevenshtein gives the following error, I have
pyxdameraulevenshtein==1.5.3
pandas==1.1.4
scikit-learn==0.20.2.
Numpy is 1.16.1.
Works well in Python 3.6, Issue in Python 3.7....
227
votes
16
answers
949k
views
ModuleNotFoundError: No module named 'sklearn'
I want to import sklearn but there is no module apparently:
ModuleNotFoundError: No module named 'sklearn'
I am using Anaconda and Python 3.6.1; I have checked everywhere but still can't find ...
211
votes
26
answers
172k
views
How to extract the decision rules from scikit-learn decision-tree?
Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree as a textual list?
Something like:
if A>0.4 then if B<0.2 then if C>0.8 then class='X'
210
votes
8
answers
297k
views
Random state (Pseudo-random number) in Scikit learn
I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it?
I also could not understand what is a Pseudo-...
198
votes
9
answers
152k
views
what is the difference between 'transform' and 'fit_transform' in sklearn
In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows
But what is the ...
180
votes
2
answers
207k
views
How does the class_weight parameter in scikit-learn work?
I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates.
The Situation
I want to use logistic regression to do binary classification ...
177
votes
11
answers
157k
views
RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this ...
172
votes
6
answers
313k
views
Parameter "stratify" from method "train_test_split" (scikit Learn)
I am trying to use train_test_split from package scikit Learn, but I am having trouble with parameter stratify. Hereafter is the code:
from sklearn import cross_validation, datasets
X = iris.data[:,:...
172
votes
30
answers
206k
views
How to convert a Scikit-learn dataset to a Pandas dataset
How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
print(type(data))
data1 = pd. # Is there a ...
166
votes
9
answers
303k
views
Can anyone explain me StandardScaler?
I am unable to understand the page of the StandardScaler in the documentation of sklearn.
Can anyone explain this to me in simple terms?
164
votes
3
answers
457k
views
How can I plot a confusion matrix? [duplicate]
I am using scikit-learn for classification of text documents(22000) to 100 classes. I use scikit-learn's confusion matrix method for computing the confusion matrix.
model1 = LogisticRegression()
...
158
votes
2
answers
138k
views
Logistic regression python solvers' definitions
I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem.
Can someone briefly describe ...
158
votes
11
answers
224k
views
How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?
I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not ...
152
votes
4
answers
104k
views
What is exactly sklearn.pipeline.Pipeline?
I can't figure out how the sklearn.pipeline.Pipeline works exactly.
There are a few explanation in the doc. For example what do they mean by:
Pipeline of transforms with a final estimator.
To ...
150
votes
7
answers
88k
views
How are feature_importances in RandomForestClassifier determined?
I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, ...
149
votes
5
answers
78k
views
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I ...
145
votes
4
answers
319k
views
How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?
I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 ...
145
votes
4
answers
111k
views
Sklearn, gridsearch: how to print out progress during the execution?
I am using GridSearch from sklearn to optimize parameters of the classifier. There is a lot of data, so the whole process of optimization takes a while: more than a day. I would like to watch the ...
144
votes
4
answers
78k
views
What are the different use cases of joblib versus pickle?
Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.
it may be more interesting to use joblib’s replacement of pickle (joblib....
141
votes
10
answers
310k
views
UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples
I'm getting this weird error:
classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, ...
138
votes
10
answers
360k
views
how to check which version of nltk, scikit learn installed?
In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script:
import nltk
echo nltk.__version__
but it stops shell script at ...
136
votes
6
answers
282k
views
Run an OLS regression with Pandas Data Frame
I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,...
135
votes
9
answers
310k
views
Stratified Train/Test-split in scikit-learn
I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below:
X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo) ...
134
votes
13
answers
358k
views
ImportError in importing from sklearn: cannot import name check_build
I am getting the following error while trying to import from sklearn:
>>> from sklearn import svm
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
...
132
votes
3
answers
40k
views
Why does one hot encoding improve machine learning performance? [closed]
I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to ...
130
votes
3
answers
375k
views
LogisticRegression: Unknown label type: 'continuous' using sklearn in python
I have the following code to test some of most popular ML algorithms of sklearn python library:
import numpy as np
from sklearn import metrics, svm
from sklearn.linear_model ...
128
votes
21
answers
272k
views
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
My problem:
I have a dataset which is a large JSON file. I read it and store it in the trainList variable.
Next, I pre-process it - in order to be able to work with it.
Once I have done that I ...
128
votes
6
answers
110k
views
Understanding min_df and max_df in scikit CountVectorizer
I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency ...
128
votes
10
answers
396k
views
sklearn plot confusion matrix with labels
I want to plot a confusion matrix to visualize the classifer's performance, but it shows only the numbers of the labels, not the labels themselves:
from sklearn.metrics import confusion_matrix
import ...
125
votes
4
answers
268k
views
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my ...
124
votes
3
answers
195k
views
Will scikit-learn utilize GPU?
Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm ...
116
votes
8
answers
170k
views
Passing categorical data to Sklearn Decision Tree
There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these
Some advantages of decision trees are:
(...)
Able to handle both ...
115
votes
6
answers
140k
views
scikit-learn .predict() default threshold
I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability.
In a binary classification problem, is scikit's classifier.predict() using 0....
115
votes
4
answers
101k
views
A progress bar for scikit-learn?
Is there any way to have a progress bar to the fit method in scikit-learn ?
Is it possible to include a custom one with something like Pyprind ?
114
votes
8
answers
338k
views
Accuracy Score ValueError: Can't Handle mix of binary and continuous target
I'm using linear_model.LinearRegression from scikit-learn as a predictive model. It works and it's perfect. I have a problem to evaluate the predicted results using the accuracy_score metric.
This is ...
112
votes
10
answers
219k
views
sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()
Just trying to do a simple linear regression but I'm baffled by this error for:
regr = LinearRegression()
regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values)
which produces:
ValueError:...
110
votes
2
answers
305k
views
Converting list to numpy array
I have managed to load images in a folder using the command line sklearn: load_sample_images()
I would now like to convert it to a numpy.ndarray format with float32 datatype
I was able to convert it ...
106
votes
14
answers
90k
views
sklearn.LabelEncoder with never seen before values
If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.
The only solution I could come up with for this is to map everything ...
105
votes
3
answers
42k
views
RandomForestClassifier vs ExtraTreesClassifier in scikit learn
Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper:
P. Geurts, D. Ernst., and L. Wehenkel, ...