Saturday, 30 July 2016

How to deploy Data Science projects?

In this Digital age today, data science has become the top skill and sexiest job of the century. 

Data science projects do not have a nice clean life-cycle with well-defined steps like software development lifecycle (SDLC), but they are non-linear, highly iterative and cyclical between the data science team and various others teams in an organization.

SAS Institute, the leader in Analytics developed its own method called SEMMA (Sample, Explore, Modify, Model & Assess) for data mining.

However, many of the companies have adopted a standard workflow of a data science called CRISP-DM (CRoss Industry Standard Process for Data Mining). It was developed by a consortium of companies like SPSS, Teradata, Daimler and NCR Corporation in 1997.

With any method the process is similar which involves following steps:
  • Business Understanding: This is the basic and first step as understanding business problem is extremely important for data scientist to move forward.
  • Data Acquisition: Based on the business problem the next step is to understand and acquire the data which is needed. Identify the sources from where it is available, who are responsible to provide that data. It can come from various data sources like customer data, demographic data, third party data, weblogs, social media data, streaming data like sensor data, audio or video data. Main challenge is to decide whether data is up-to-date and clean for model consumption. With Internet of Things in full swing, data acquisition into Big Data platform is important step.
  • Data Preparation: This is also called as data wrangling phase which takes almost 60% of overall project time. Collected data has to be formatted, treated for any missing values, any abnormalities or seasonality from the data and make it ready for model consumption.
  • Modelling: This is the core activity of a data science project that requires writing, running and refining the programs to analyse and derive meaningful business insights from data. Often open sources tools like R, Python and commercial tools like SAS, IBM SPSS are used to create the statistical models. Various machine learning techniques are applied to data based on the business problem.
  • Evaluation: There are several methods to compare the developed models and then use the best model for deployments. Typical comparison methods are AUC – area under curve, Confusion matrix, Gain/Life charts, Root Mean Squared Error etc.
  • Deployment: Once the most suitable model is identified above, it is further tested with live data and then deployed into production environment.

There are further steps as well such as monitoring the live model performance, observe any degradation and new models are developed which are again compared with live model.

Data Science has evolved beyond normal predictive modeling into recommendation engines, text mining, deep learning, Artificial Intelligence. The foundation still remains the same of data gathering, data cleaning and then applying various algorithms.

5 comments:

  1. You made a good post here to read. Content writing is not at all easy for all. It requires daily practice and work out. Reading more is a fine way to prepare good articles without any mistakes.

    ReplyDelete
  2. The term often refers simply to the use of predictive analytics, user behaviour analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. Thanks a lot admin to sharing with us. Also Learn BigData from the best BigData Online Training in your locality at CatchExperts.com

    ReplyDelete
  3. keep sharing your information regularly for my future reference. This content creates a new hope and inspiration with in me

    Digital Marketing Company in Chennai

    Digital Marketing Services in Chennai

    ReplyDelete


  4. What an awesome post, I just read it from start to end. Learned something new after a long time.


    SAP SD training in Chennai

    ReplyDelete
  5. Hi,
    Find good articles and real information.

    The bigger issue that we need to concern ourselves with is whether or not we have the need for an email database in order to become effective Big Data Users Email List in our email marketing

    ReplyDelete

360TotalSecurity WW