I am using pandas and scikit to explore machine learning. I am experimenting with reading historical information about ships and attempting to classify the type of ship. I am very confused with the proper format of the data.

I have a data set that looks like the following in a pandas dataframe

     mmsi vessel_type        vessel_name  length  width  flag_country                  
412417097     Unknown     ZHELINGYU52052      36      7         China       
412422090       Cargo        JIE BANG 17      50      8         China    
413975458       Other  GUIPINGNANHUO2626     308     14         China      
412456888       Other  ZHE CANG YU 72777     194     14         China       
413808364     Unknown    WAN SHUN FA1818      55     10         China       
215380000       Cargo          GIORGOS B     186     30         Malta

my initial experiment is to see if i can predict the vessel_type by using a set of trivial features (length, width, flag_country).

Question 1

What is the correct format of the data for passing to scikit? I know I need a list of features but am not sure what the content should be. see below:

features
[36,7,China]
[50,8,China]
[308,14,China]

or do i need to map the flag_country to an integer value? FI so is there an easy way to do this in scikit such that when I run metrics the country is actually displayed instead of the corresponding integer code.

likewise with labels i am not sure how to represent the labels. SHoudl I have only one value represented for each feature row, or do I need all labels for every feature row with all but the single label associated to the row set to 0?

Labels
['Unknown']
['Cargo']
['Other']

-or-

Unknown Cargo  Other
   1      0      0
   0      1      0
   0      0      1

Question 2

In addition to the content of the files I am having a hard time getting the pandas dataframe type to be accepted by scikit. This is new. In the past I did some textbase classification using pandas and scikit and everything juts worked like magic. No whenever I try and pass pandas dataframe objects to scikit i get errors like ValueError: Array contains NaN or infinity. I went back to my previous experiments and tried to run the same code that worked about 9 months ago and get the same error. I have upgraded both pandas and scikit since the successful experiment, but did not expect to encounter such significant incompatibilities. Below is my base attempt to passing data to scikit. Note I have tried several explicit conversions to mp.arrays, using dataframe.values, but nothing seems to work.

labels = df_train["vessel_type"]
features = df_train[featurelist]
clf = LogisticRegression()
clf.fit(features,labels)

sorry for the long post, but i am pretty much stuck on this after having spend over 10 hours to try and get the 1st experiment to work. Any help or advice would be greatly appreciated.

asked Feb 16 '14 at 12:43

david's gravatar image

david
1112

edited Feb 16 '14 at 12:45

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.