|
I am using pandas and scikit to explore machine learning. I am experimenting with reading historical information about ships and attempting to classify the type of ship. I am very confused with the proper format of the data. I have a data set that looks like the following in a pandas dataframe
my initial experiment is to see if i can predict the vessel_type by using a set of trivial features (length, width, flag_country). Question 1 What is the correct format of the data for passing to scikit? I know I need a list of features but am not sure what the content should be. see below:
or do i need to map the flag_country to an integer value? FI so is there an easy way to do this in scikit such that when I run metrics the country is actually displayed instead of the corresponding integer code. likewise with labels i am not sure how to represent the labels. SHoudl I have only one value represented for each feature row, or do I need all labels for every feature row with all but the single label associated to the row set to 0?
-or-
Question 2 In addition to the content of the files I am having a hard time getting the pandas dataframe type to be accepted by scikit. This is new. In the past I did some textbase classification using pandas and scikit and everything juts worked like magic. No whenever I try and pass pandas dataframe objects to scikit i get errors like ValueError: Array contains NaN or infinity. I went back to my previous experiments and tried to run the same code that worked about 9 months ago and get the same error. I have upgraded both pandas and scikit since the successful experiment, but did not expect to encounter such significant incompatibilities. Below is my base attempt to passing data to scikit. Note I have tried several explicit conversions to mp.arrays, using dataframe.values, but nothing seems to work.
sorry for the long post, but i am pretty much stuck on this after having spend over 10 hours to try and get the 1st experiment to work. Any help or advice would be greatly appreciated. |