Skip to main content

Questions tagged [scikit-learn]

Scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and commercially usable (BSD license).

Filter by
Sorted by
Tagged with
324 votes
25 answers
390k views

Label encoding across multiple columns in scikit-learn

I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'...
Bryan's user avatar
  • 6,129
317 votes
16 answers
1.1m views

How to normalize a numpy array to a unit vector

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function: def normalize(v): norm = np.linalg.norm(v) if ...
Donbeo's user avatar
  • 17.4k
272 votes
14 answers
538k views

Is there a library function for Root mean square error (RMSE) in python?

I know I could implement a root mean squared error function like this: def rmse(predictions, targets): return np.sqrt(((predictions - targets) ** 2).mean()) What I'm looking for if this rmse ...
siamii's user avatar
  • 23.9k
267 votes
27 answers
828k views

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

I am using sklearn and having a problem with the affinity propagation. I have built an input matrix and I keep getting the following error. ValueError: Input contains NaN, infinity or a value too ...
Ethan Waldie's user avatar
  • 2,869
264 votes
15 answers
497k views

ImportError: No module named sklearn.cross_validation

I am using python 2.7 in Ubuntu 14.04. I installed scikit-learn, numpy and matplotlib with these commands: sudo apt-get install build-essential python-dev python-numpy \ python-numpy-dev python-...
arthurckl's user avatar
  • 5,371
258 votes
7 answers
163k views

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data? I have the following sample program from the scikit-learn website: from sklearn import datasets iris = datasets....
garak's user avatar
  • 4,763
251 votes
11 answers
417k views

Find p-value (significance) in scikit-learn LinearRegression

How can I find the p-value (significance) of each coefficient? lm = sklearn.linear_model.LinearRegression() lm.fit(x,y)
elplatt's user avatar
  • 3,337
250 votes
9 answers
339k views

pandas dataframe columns scaling with sklearn

I have a pandas dataframe with mixed type columns, and I'd like to apply sklearn's min_max_scaler to some of the columns. Ideally, I'd like to do these transformations in place, but haven't figured ...
flyingmeatball's user avatar
249 votes
13 answers
245k views

How to split data into 3 sets (train, validation and test)?

I have a pandas dataframe and I wish to divide it to 3 separate sets. I know that using train_test_split from sklearn.cross_validation, one can divide the data in two sets (train and test). However, I ...
CentAu's user avatar
  • 11k
245 votes
9 answers
352k views

A column-vector y was passed when a 1d array was expected

I need to fit RandomForestRegressor from sklearn.ensemble. forest = ensemble.RandomForestRegressor(**RF_tuned_parameters) model = forest.fit(train_fold, train_y) yhat = model.predict(test_fold) This ...
Klausos Klausos's user avatar
236 votes
11 answers
132k views

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
bmasc's user avatar
  • 2,490
234 votes
20 answers
358k views

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Importing from pyxdameraulevenshtein gives the following error, I have pyxdameraulevenshtein==1.5.3 pandas==1.1.4 scikit-learn==0.20.2. Numpy is 1.16.1. Works well in Python 3.6, Issue in Python 3.7....
Sachit Jani's user avatar
  • 2,441
227 votes
16 answers
949k views

ModuleNotFoundError: No module named 'sklearn'

I want to import sklearn but there is no module apparently: ModuleNotFoundError: No module named 'sklearn' I am using Anaconda and Python 3.6.1; I have checked everywhere but still can't find ...
Hareez Rana's user avatar
  • 2,283
211 votes
26 answers
172k views

How to extract the decision rules from scikit-learn decision-tree?

Can I extract the underlying decision-rules (or 'decision paths') from a trained tree in a decision tree as a textual list? Something like: if A>0.4 then if B<0.2 then if C>0.8 then class='X'
Dror Hilman's user avatar
  • 7,347
210 votes
8 answers
297k views

Random state (Pseudo-random number) in Scikit learn

I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter random_state does? Why should I use it? I also could not understand what is a Pseudo-...
Elizabeth Susan Joseph's user avatar
198 votes
9 answers
152k views

what is the difference between 'transform' and 'fit_transform' in sklearn

In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows But what is the ...
tqjustc's user avatar
  • 3,764
180 votes
2 answers
207k views

How does the class_weight parameter in scikit-learn work?

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates. The Situation I want to use logistic regression to do binary classification ...
kilgoretrout's user avatar
  • 3,627
177 votes
11 answers
157k views

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this ...
Blue482's user avatar
  • 3,016
172 votes
6 answers
313k views

Parameter "stratify" from method "train_test_split" (scikit Learn)

I am trying to use train_test_split from package scikit Learn, but I am having trouble with parameter stratify. Hereafter is the code: from sklearn import cross_validation, datasets X = iris.data[:,:...
Daneel Olivaw's user avatar
172 votes
30 answers
206k views

How to convert a Scikit-learn dataset to a Pandas dataset

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame? from sklearn.datasets import load_iris import pandas as pd data = load_iris() print(type(data)) data1 = pd. # Is there a ...
SANBI samples's user avatar
166 votes
9 answers
303k views

Can anyone explain me StandardScaler?

I am unable to understand the page of the StandardScaler in the documentation of sklearn. Can anyone explain this to me in simple terms?
nitinvijay23's user avatar
  • 1,831
164 votes
3 answers
457k views

How can I plot a confusion matrix? [duplicate]

I am using scikit-learn for classification of text documents(22000) to 100 classes. I use scikit-learn's confusion matrix method for computing the confusion matrix. model1 = LogisticRegression() ...
minks's user avatar
  • 2,979
158 votes
2 answers
138k views

Logistic regression python solvers' definitions

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem. Can someone briefly describe ...
Clement's user avatar
  • 1,730
158 votes
11 answers
224k views

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not ...
Louic's user avatar
  • 2,553
152 votes
4 answers
104k views

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly. There are a few explanation in the doc. For example what do they mean by: Pipeline of transforms with a final estimator. To ...
farhawa's user avatar
  • 10.3k
150 votes
7 answers
88k views

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, ...
user2244670's user avatar
  • 1,501
149 votes
5 answers
78k views

What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I ...
O.rka's user avatar
  • 30.5k
145 votes
4 answers
319k views

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this: label instances 5 1190 4 838 3 239 1 204 2 127 So my data is unbalanced since 1190 ...
new_with_python's user avatar
145 votes
4 answers
111k views

Sklearn, gridsearch: how to print out progress during the execution?

I am using GridSearch from sklearn to optimize parameters of the classifier. There is a lot of data, so the whole process of optimization takes a while: more than a day. I would like to watch the ...
doubts's user avatar
  • 1,822
144 votes
4 answers
78k views

What are the different use cases of joblib versus pickle?

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle. it may be more interesting to use joblib’s replacement of pickle (joblib....
msunbot's user avatar
  • 1,951
141 votes
10 answers
310k views

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples

I'm getting this weird error: classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, ...
Sticky's user avatar
  • 3,859
138 votes
10 answers
360k views

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at ...
nlper's user avatar
  • 2,377
136 votes
6 answers
282k views

Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example: import pandas as pd df = pd.DataFrame({"A": [10,20,30,...
Michael's user avatar
  • 13.7k
135 votes
9 answers
310k views

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below: X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo) ...
pir's user avatar
  • 5,785
134 votes
13 answers
358k views

ImportError in importing from sklearn: cannot import name check_build

I am getting the following error while trying to import from sklearn: >>> from sklearn import svm Traceback (most recent call last): File "<pyshell#17>", line 1, in <module> ...
ayush singhal's user avatar
132 votes
3 answers
40k views

Why does one hot encoding improve machine learning performance? [closed]

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to ...
maheshakya's user avatar
  • 2,208
130 votes
3 answers
375k views

LogisticRegression: Unknown label type: 'continuous' using sklearn in python

I have the following code to test some of most popular ML algorithms of sklearn python library: import numpy as np from sklearn import metrics, svm from sklearn.linear_model ...
mllamazares's user avatar
  • 8,006
128 votes
21 answers
272k views

Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative

My problem: I have a dataset which is a large JSON file. I read it and store it in the trainList variable. Next, I pre-process it - in order to be able to work with it. Once I have done that I ...
Euskalduna's user avatar
  • 1,607
128 votes
6 answers
110k views

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency ...
moeabdol's user avatar
  • 4,979
128 votes
10 answers
396k views

sklearn plot confusion matrix with labels

I want to plot a confusion matrix to visualize the classifer's performance, but it shows only the numbers of the labels, not the labels themselves: from sklearn.metrics import confusion_matrix import ...
hmghaly's user avatar
  • 1,452
125 votes
4 answers
268k views

ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my ...
sums22's user avatar
  • 1,983
124 votes
3 answers
195k views

Will scikit-learn utilize GPU?

Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm ...
blue-sky's user avatar
  • 53.3k
116 votes
8 answers
170k views

Passing categorical data to Sklearn Decision Tree

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these Some advantages of decision trees are: (...) Able to handle both ...
0xhfff's user avatar
  • 1,275
115 votes
6 answers
140k views

scikit-learn .predict() default threshold

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability. In a binary classification problem, is scikit's classifier.predict() using 0....
ADJ's user avatar
  • 5,112
115 votes
4 answers
101k views

A progress bar for scikit-learn?

Is there any way to have a progress bar to the fit method in scikit-learn ? Is it possible to include a custom one with something like Pyprind ?
user avatar
114 votes
8 answers
338k views

Accuracy Score ValueError: Can't Handle mix of binary and continuous target

I'm using linear_model.LinearRegression from scikit-learn as a predictive model. It works and it's perfect. I have a problem to evaluate the predicted results using the accuracy_score metric. This is ...
Arij SEDIRI's user avatar
  • 2,118
112 votes
10 answers
219k views

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Just trying to do a simple linear regression but I'm baffled by this error for: regr = LinearRegression() regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values) which produces: ValueError:...
sunny's user avatar
  • 3,861
110 votes
2 answers
305k views

Converting list to numpy array

I have managed to load images in a folder using the command line sklearn: load_sample_images() I would now like to convert it to a numpy.ndarray format with float32 datatype I was able to convert it ...
Priya Narayanan's user avatar
106 votes
14 answers
90k views

sklearn.LabelEncoder with never seen before values

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set. The only solution I could come up with for this is to map everything ...
cjauvin's user avatar
  • 3,623
105 votes
3 answers
42k views

RandomForestClassifier vs ExtraTreesClassifier in scikit learn

Can anyone explain the difference between the RandomForestClassifier and ExtraTreesClassifier in scikit learn. I've spent a good bit of time reading the paper: P. Geurts, D. Ernst., and L. Wehenkel, ...
denson's user avatar
  • 2,416

1
2 3 4 5
567