In today’s digital transformation,
big data has given organization an edge to analyze the customer behavior &
hyper-personalize every interaction which results into cross-sell, improved
customer experience and obviously more revenues.
The market for Big Data has grown
up steadily as more and more enterprises have implemented a data-driven
strategy. While Apache Hadoop is the most well-established tool for analyzing
big data, there are thousands of big data tools out there. All of them
promising to save you time, money and help you uncover never-before-seen
business insights.
I have selected few to get you
going….
Avro: It was developed by Doug
Cutting & used for data serialization for encoding the schema of Hadoop
files.
Cassandra: is a distributed and
Open Source database. Designed to handle large amounts of distributed data
across commodity servers while providing a highly available service. It is a
NoSQL solution that was initially developed by Facebook. It is used by many
organizations like Netflix, Cisco, Twitter.
Drill: An open source distributed
system for performing interactive analysis on large-scale datasets. It is
similar to Google’s Dremel, and is managed by Apache.
Elasticsearch: An open source
search engine built on Apache Lucene. It is developed on Java, can power
extremely fast searches that support your data discovery applications.
Flume: is a framework for
populating Hadoop with data from web servers, application servers and mobile
devices. It is the plumbing between sources and Hadoop.
HCatalog: is a centralized metadata
management and sharing service for Apache Hadoop. It allows for a unified view
of all data in Hadoop clusters and allows diverse tools, including Pig and
Hive, to process any data elements without needing to know physically where in
the cluster the data is stored.
Impala: provides fast, interactive
SQL queries directly on your Apache Hadoop data stored in HDFS or HBase using
the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue
Beeswax) as Apache Hive. This provides a familiar and unified platform for
batch-oriented or real-time queries.
JSON: Many of today’s NoSQL
databases store data in the JSON (JavaScript Object Notation) format that’s
become popular with Web developers
Kafka: is a distributed
publish-subscribe messaging system that offers a solution capable of handling
all data flow activity and processing these data on a consumer website. This
type of data (page views, searches, and other user actions) are a key
ingredient in the current social web.
MongoDB: is a NoSQL database
oriented to documents, developed under the open source concept. This comes with
full index support and the flexibility to index any attribute and scale
horizontally without affecting functionality.
Neo4j: is a graph database & boasts
performance improvements of up to 1000x or more when in comparison with
relational databases.
Oozie: is a workflow processing
system that lets users define a series of jobs written in multiple languages –
such as Map Reduce, Pig and Hive. It further intelligently links them to one
another. Oozie allows users to specify dependancies.
Pig: is a Hadoop-based language
developed by Yahoo. It is relatively easy to learn and is adept at very deep,
very long data pipelines.
Storm: is a system of real-time distributed
computing, open source and free. Storm
makes it easy to reliably process unstructured data flows in the field of
real-time processing. Storm is fault-tolerant and works with nearly all
programming languages, though typically Java is used. Descending from the
Apache family, Storm is now owned by Twitter.
Tableau: is a data visualization
tool with a primary focus on business intelligence. You can create maps, bar
charts, scatter plots and more without the need for programming. They recently
released a web connector that allows you to connect to a database or API thus
giving you the ability to get live data in a visualization.
ZooKeeper: is a service that
provides centralized configuration and open code name registration for large
distributed systems.
Everyday many more tools are
getting added the big data technology stack and its extremely difficult to cope
up with each and every tool. Select few which you can master and continue upgrading your knowledge.