chitkara logo
Vol.2, Issue-13,April 2016
Published by:-Chitkara University

BIG DATA and Analytical Challenges-A Brief Overview

We are living in an era where we are connected to internet 24X7 and that through many devices at a time, for example I may be using a social media site on my smart phone and making an online transaction on my PC etc! The amount of data that is being generated through our day to day activities over the internet is giving rise to Big Data problem. According to a survey done by IBM, 90% of the data available on the internet has been generated in last 5 years. Internet which is available to us since late 70s had never seen such a flood of data ever.

People to people transactions over social media/networking, people to machine transaction over ecommerce & banking sites etc, sensor network data to detect presence of intruders over an area are some of the few examples of sources of data being generated over the internet. The following figure shows some of the potential Big Data sources:

Sources of Big Data
Source: https://www.mssqltips.com/sqlservertip/3132/big-data-basics--part-1--introduction-to-big-data/

What distinguishes traditional data from Big Data:

Big data is characterized by 3Vs which distinguishes it from traditionally generated data. The 3Vs are: Volume, Velocity and Variety.

3Vs of Big data
(Source: http://sci2s.ugr.es/BigData)

Here is an explanation of each of the 3 Vs:-

Volume: The volume or amount of data that is being generated over the internet is tremendous. Earlier days of internet had a concept of web directories to search for information over internet. These directories worked more like telephone directories where data was saved in alphabetical order. But the turn around time for each search was very large. Google changed the searching over internet altogether by introducing the concept of indexing and crawlers to gather information of each keyword on websites and saving it on its server. This resulted in capturing of data day in and day out 24X7. This generated huge volume of data over the internet. With the advent of Social media sites and its features available over desktop as well as hand held devices, the amount of data almost increased by 40 times. According to an estimate we are generating 500TB (TeraBytes) of data on Facebook every day.

Velocity: The speed with which the data is being generated is another feature of Big Data. There is no fixed time frame or intervals in which data is being generated. The servers are supposed to track each of the activities and the data thus generated needs to be saved. Let's imagine a real time scenarios where we go to an ecommerce website say Amazon.com and search for a travel bag. Next time we go to the same browser and we are flooded with ads selling travel bags. How is it possible? Well, Click streams (Clicking of mouse over a particular website) tracks the mouse clicks that are made on a particular website from a particular IP address. Now just imagine the no. of clicks we make on a website and no. of visitors going to a particular website. This explains the frequency with which data is arriving into a server arising in Big Data.

Variety: Data sets generated are from variety of sources and are of various types. Google Search generates text data. YouTube generates Videos. Picasa generates images. But there is no set data type of data being made available on internet and saved on Web servers.

Analytical Challenges

Traditionally data was stored in form of matrices (combination of rows and columns). Statisticians who were looked upon for providing analytics of this data using various measures of statistics are now feeling helpless in applying their measures on these data. Here is an explanation of the same. Traditional statistical problems stored data in N*P form of matrix. Say marks of N students in P subjects and we need to calculate Mean of each subject score. In this scenario we had a fixed no. of N and fixed no. of P (for some case we say that P is dynamic). But N was always fixed. But in case of big data the N is no longer fixed thus rendering the measures of central tendency ineffective. Data storage in databases (which stores data in form of table) becomes ineffective in view of the fact that the amount of data being generated over the internet (let's say for one website) cannot be stored in one database alone. It requires parallel and distributed databases running 24X7 to capture data. Applying database and warehousing queries over these servers can simply pull the data from the website. But when we do an analytics we expect information from the data which the simple SQL queries cannot provide due to the high velocity of data flowing in and out of servers.


By-Prasenjit Das, Asst. Prof CSE, Chitkara University, H.P.

About Technology Connect
Aim of this weekly newsletter is to share with students & faculty the latest developments, technologies, updates in the field Electronics & Computer Science and there by promoting knowledge sharing. All our readers are welcome to contribute content to Technology Connect. Just drop an email to the editor. The first Volume of Technology Connect featured 21 Issues published between June 2015 and December 2015. This is Volume 2.
Happy Reading!

Disclaimer:The content of this newsletter is contributed by Chitkara University faculty & taken from resources that are believed to be reliable.The content is verified by editorial team to best of its accuracy but editorial team denies any ownership pertaining to validation of the source & accuracy of the content. The objective of the newsletter is only limited to spread awareness among faculty & students about technology and not to impose or influence decision of individuals.