Cab fare Predictor
cabin price forecast" Meta data": "_uuuid": "ece45b62b3957331853a41b26aae4542336e287c", "cell_type": "markdown", "source" : "! Meta data": "_uuuid": "77f705b587b373d53161b4b4b4aa8f41061d0a7482", "cell_type": "transcript", "source" : "Â "Â "# Imports libraries and data", "Metadata": "_uuuid": "8f2839f25d086af736a60e9eeb907d3b93b93b6e0e5", "_cell_guid" : "The following is a list of the files that can be imported: "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "trusted":, "collapsed":, "cell_type": "code", "source": "import number as np # linear algebra input panda as np # computer program, CSV I/O files (e.
Meta data": "trustworthy":, "_uuuid": "48966199c35e8b571c0a27865e6db5b3c5c2b914", "collapsed":, "cell_type": "code", "source": "#Load test dataset\ntest=pd..read_csv('../input/test. csv')", "execution_count" :
Meta data": "trustworthy":, "collapses":, "_uuuid": "318fae3a54c8577dd518a4581b1e84d102b3583e", "cell_type": "code", "source": "#Set light weight data sets in a data base to accelerate the data set (float64 is an overskill for GPS coordinates)\ntypes = {'fare_amount':
Meta data": "trustworthy":, "_uuuid": "2da835a57027e504aa5cdd10d3743f5385ed4100", "compressed":, "cell_type": "code", "source": "#load part of the data set in the specified types\ntrain = ford. Cell_type" : "markdown", "source" : "# Data Exploration and Cleanup", "métadonnées" : "_uuuuid", "a2622cf2aca88d5184c756604e071e99d1e74127", "cell_type" : "markdown", "source" : "# Data Exploration and Cleanup", "métadonnées
"The following is a list of the files: "bfcfa15c386bd66dc13a58130dfdd0a9f3a372e7", "cell_type": "markdown", "source": "Now let's see what the files look like and some general stats about them". "Meta data": "trustworthy":, "_uuuid": "898f3e955a00c2c624ca82a973ccad644c4765c5", "cell_type" : "code", "source" : "train. head()", "execution_count" : Meta data": "trustworthy":, "_uuuid": "93d22a102770e74a76f3c175d7261a4fde8534ca", "cell_type": "code", "source" : "train. describe()", "execution_count" :
Meta Daten" : "_uuuuid" : "cdd68e8d269c09c09c5ce88868fb3c3bc3bc3bcd259533947", "cell_type" : "markdown", "source" : "Traçons la plots Meta Daten" : "_uuuuid" : "cdd68e8d269c09c09c5ce88868fb3c3bc3bc3bcd259533947", "cell_type" : "markdown", "source" : "Traçons la plots de distributing des valeurs pour mieux connaître les données et comment elles sont réparties". Do not hesitate to review the allocations of other readings.
Meta data": "_uuuid": "44d6faa31ad1ddf0c9fc65ebe0cc28fe962a9c84", "cell_type": "markdown", "source": "Based on a brief look at the meta information, we can see that it is not 100% accurate and some records will help to increase it.
Since there are more than enough records, we can simply drop all lines with zeroes. Remember that you may not see zero readings based on the size of a disc of exercise information, but there are some in the 6 million lines.
Meta data": "trusted":, "_uuuid": "431f55ded9d8b9e52df2e2e00b57ae0eb41b5a64", "collapsed":, "cell_type":, "code", "source": "#Drop zeros if exist\ntrain. dropna(inplace=True)", " execution_count" : meta data": "trustworthy":, "_uuuid": "11cc664af06d9ccb6c614a834e568aa84603d777", "compressed":, "cell_type" : "code", "source" :, "metadata" : "_uuuid" : "The following is 87b3dc548574836fb8af6ed60d493582390fb566", "cell_type": "markdown", "source": "Now we do not see any apparent inconsistencies with the dates.
" Meta data": "trusted":, "_uuuid": "e6063be2bb8b13865322cbcc6500d6a95a72ebfc", "cell_type": "code", "source" : "train. describe()", "execution_count" : MetaDaten" : "_uuuuid" : "9c17549ac5b186af7159bcfb4790b955f023748e", "cell_type" : "markdown", "source" : "# Feature Engineering", "metadata" : "_uuuuuid" : "8397c44d0b4f09ffbbb9799b744aafea9eb1a288", "cell_type" : "Transcript ", "Source": "We only have 8 direct usable column names, but we can get much more information from them by extracting technical characteristics from these column names or a combination of them.
First we can determine the gap between the pick-up and drop-off points, which should be a powerful predictor of the fare. "Meta data": "trustworthy":, "_uuuid": "291eb371721ffa6809846f503eaf9e2c45dafbb1", "collapsed":, "cell_type": "code", "source": "#Define functions for calculating the distances in km from coordinates\ndef dist_calc(df):\n \n \n \n for i,row in df.
Meta data": "_uuuid": "60b143599a2df87b005e5173fbacbfac0b1a68d5", "cell_type": "markdown", "source": "Then, if you look at the clickup_datetime colum, we can get some date-related functions from it. "Meta data": "trustworthy":, "_uuuid": "9a86889e8073b1a258c884c8b66b03557137407c", "collapsed":, "cell_type": "code", "source": "#Applicable date for characteristic engineering\ntrain['pickup_datetime'] = train['pickup_datetime'].
Meta data": "trustworthy":, "_uuuid": "347fdfef7c06e5552d3893b5ef4747abc8eb7a16", "collapsed":, "cell_type": "code", "source":: "#Getting numbers interger numbers from the pickup_datetime\ntrain[\"hour\"] = training. pickup_datetime.dt. hour\ntrain[\"weekday\"] = training on. pickup_datetime.dt. weekday\ntrain[\"month\"] = training. Meta data": "_uuuid": "6f373b52361bae81f0f292767c108fe8b0c01381", "cell_type": "markdown", "source": "Last but not least, we can look at the fare correction with the distances from certain hot spots around New York, where fare will be higher or lower than normal.
"Meta data": "trustworthy":, "collapses":, "_uuuid": "c7f2d821bcc322fee0d87456f25c4a720fec958b", "cell_type": "code", "source": "#function for range computation between co-ordinates as illustrated variables\ndef sphere_dist(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon): #function for range computation between co-ordinates: Meta data": "trustworthy":, "collapsed":, "_uuuid": "037e5cc188bb2d97b0ccc376c3df1ff586db4816", "cell_type": "code", "source": "#Function to calculate the spacing between recently won hotspot locations.
Meta data": "trustworthy":, "_uuuid": "ff9018e11374b447f7126bd17172c574c8e7975e", "cell_type": "code", "source": "train. head()", "execution_count" : Meta data": "trustworthy":, "_uuuid": "7184164b7d6b72e65f223e644ad07b7571ea6fd7", "cell_type": "code", "source" : "test. head()", "execution_count" : Meta data": "_uuuid": "9400cdcf56453de9e9b7defdff1ad1ad2f94a87d", "cell_type": "markdown", "source": "You can represent a relation between all characteristics by means of a heat map.
Further away the value is from 0, the more influence they have on the fare forecast. "Meta data": "trustworthy":, "_uuuid": "68c2abd8c2db88aace531c80ba58521e2961a4a4", "cell_type": "code", "source": "#Plot heat map of the value correlations\nplt. Meta data": "_uuuid": "f7a9521c1fd9effa7d733dc82f80daa43be4ee51", "cell_type": "markdown", "source": "# Model Training", "metadata": "_uuuid": "f9ca276b45880efb0c35c44b27d06eaaf5b58d44", "cell_type" :
"Transcript ", "Source": "Since the coordinate column seems to be directly correlated with the fare_amount, I chose to keep it together with all regenerated functions in the customization and forecast of the models. Now, we don't have to fall predictive gaps and divide the dates into tension and test kits for exercising the sled.
" Meta data": "trustworthy":, "_uuuid": "bbfdd80f2d3c656d7758e4d725e62143374d28c9", "collapsed":, "cell_type": "code", "source": "X = training. Meta data": "trustworthy":, "_uuuid": "63e6f9826087523d57bd4544f23a1cf81898c533", "cell_type": "code", "source" : "X. head()", "execution_count" : Meta data": "trustworthy":, "_uuuid": "f0dabaffc1c5bd8a92474a2b18b92b922d2c9d18", "cell_type": "code", "source": "y. head()", "execution_count" :
Meta data": "trustworthy":, "_uuuid": "e448a8e930bc758728d3214e76d7acff4dcdcf4a", "compressed" :, "cell_type" : "code", "source" : "put #Split move in test and trained partial sets\nX_train, X_test, y_train, y_train, y_test and y_test = train_test_split(X,y, test_size=0. Meta data": "trustworthy":, "collapsed":, "_uuuid": "7d451a0683d4a58a532946cc72fdb71b2d24813e", "cell_type": "code", "source": "#Drop column from the test data record we will not actually start using\ntest_pred = test.
Meta data": "trustworthy":, "_uuuid": "b9549c65a163f38b66ab0b914fbc3d181014d9da", "compressed":, "cell_type": "code", "source": "#scale value, if necessary for a certain type of data " #scaler = RobustScaler ()\n#X_train_scaled = scalers". Meta data": "_uuuid": "1a5ee73e95ed384c9355bd313cd44db51882cb19", "cell_type": "markdown", "source": "## Linear Regression", "metadata": "_uuuid" :
"Approx. 5a56daba9842c2613252bfdb81240363c7925a", "cell_type": "markdown", "source": "Let's begin with a single lineal regulator and see how good it is for forecasting the fare amount. "Meta data": "trustworthy":, "_uuuid": "eca71a4dbd0dcd72f9bd743af7b0ba2c87ccc435", "cell_type": "code", "source": "#Initiating a straight-line algorithm, adjusting the values and obtaining scores\nlm = straight-line algorithm()\nlm.
Meta data": "trustworthy":, "_uuuid": "44b97a9085e9634f985d8fc81f950a2676bfb17e", "cell_type": "code", "source": "#Predict tariffs and get an answer for them\ny_pred = mm.
predict (test_pred)\nLinearPredictions = np. round(LinearPredictions, decimals=2)\nLinearPredictions", "execution_count" : Meta data": "trustworthy":, "_uuuid": "7bfcb6a630b67bc0663ee323f2e39c6d5e679c31", "cell_type": "code", "source": "#Check forecasts have the right dimensions\nLinear forecasts. size", "execution_count" : Meta data": "trustworthy":, "_uuuid": "d08f009fad913525c3b1816ab278cecd438e651e", "collapsed":, "cell_type": "code", "source": "#set up forecasts for a transferable dataframe\linear_submission = pd.
Meta data": "_uuuid": "e5331d775090e908f3d299d2d2dc2e9b3b7f9b19e1", "cell_type": "markdown", "source" : "Gradient Boosting", "Metadata" : "_uuuid" : "f564addcfd090565bb1a8fdd9c00b06d713edf60", "cell_type" :
"Transcript ", "Source": "Linear Rebound seemed to work reasonably well, but you can get much more precise forecasts with a finer adjusted modeling, such as optimizing slope gain. "Meta data": "trustworthy":, "_uuuid": "c741b4c8be0d9c90bf878cb039a1c310865cff7e", "collapsed" :, "cell_type" :
Meta data": "trustworthy":, "_uuuid": "0bc0f9df9df843d774ba1d9310912bd748d25a72fc4", "scrolled" :, "cell_type" : "code", "source" : "#Customize and optimize the models, generating forecasts\nxgbm = XGBoost(X_train,X_test,X_test,y_train,y_test)\nXGBPredictions = xgbm. predict(xgb. size) Meta data": "trustworthy":, "_uuuid": "f5465e82627d9672ee0512d5795118938ca605d6", "cell_type": "code", "source": "#Check if forecasts look realistic\nXGBPredictions", "execution_count" :
Meta data": "trustworthy": "_uuuid": "1da3d4151ce24a5bc9bc9b494bb114b8cb0a1a2bf66", "cell_type": "code", "source": "#Round forecasts to 2 dB forecasts\nXGB forecasts = np. round(XGB forecasts, dB=2)\nXGB forecasts", "execution_count" : Meta data": "_uuuid": "82e8c46b5864d58cfb59ce2f58546793c385a84b", "cell_type": "transcript", "source" : "Filing", "Metadata": "_uuuid": "0ac3b63fddaadaa8359f4e12292323e8718e272e", "cell_type": "Transcript", "Source" :
"After the XGB did much better than straight-line compression, we prepared the transfer file by formatting it and matching each forecast tariff with the appropriate one.
XGB predictions},columns = ['key','fare_amount'])\nXGB_submission. head()", "execution number" : Meta data": "trustworthy":, "_uuuid": "adca2aceb5db7c49bd113f03c57a9fc13ec8eb5e", "compressed":, "cell_type": "code", "source": "#submission = linear_submission\nsubmission= XGB_submission", "execution_count" : Meta data": "trusted":, "_uuuid": "2c7997718e0db0972b5d2a196c2b25e878e2b9a4", "collapsed":, "cell_type": "code", "source": "#Generate the final submission CVS file\nsubmission.
Meta data": "trustworthy":, "collapses":, "_uuuid": "233b44ff58b8ee4eb3f1aa87d81e70b954231033", "cell_type": "transcript", "source": "That's it for this short tutorial to explore, analyze and predict some baseline information. Meta data" : "kernelspec" : "display_name" : "Python 3", "language" : "python", "name" : "python3", "language_info":