Thursday, March 3, 2016

Big Data Types and Data Warehousing 

The only constant is change and the change that is sweeping out generation is to go digital. With the increase in digitization there is a major increase in the data available. In this post I will be talking about two types of data that people are accustomed to using and hence will be taking a closer look at the differences between the two i.e. structured and unstructured data.

Structured vs. Unstructured
1)    Structured Data: In my understanding structured data is organized. It is something that can be displayed in columns and rows which makes it easy to analyze. This type is easy for machines to read and understand. It is very easy to search with basic algorithms. It is very important for businesses and acts as a backbone to provide business insights. Examples are Sensor data - GPS data, manufacturing sensors, medical devices, Point of Sale Data - credit card information, location of sale, product information, Call Detail Records - time of call, caller and recipient information, Web Server Logs - Page requests, other server activity or any data inputted into a computer: age, zip code, gender, etc. Structured Query Language is the most common method used to question data. Operations like insert, delete, update can be performed.

2)    Unstructured Data: This on the other hand is not organized and does not follow any structured data model. It can be displayed in a particular format and hence cannot be very easily analysed. The primary use of this data is to make sense out it or in other words make it structured to help businesses make better decisions. Since this type makes up all 85% of the data available it is very important to make use of it. Their biggest source is social media data. Examples include emails, text documents (Word docs, PDFs, etc.), social media posts, videos, audio files, and images.Object oriented platform NoSQL and Hadoop can be used to handle unstructured data.
Sources of Data

Types of Data:
There are many types of data available to use by an organization. In this post I will be listing down a few types:
  • Spatial Data: It is the data that has several dimensions. This data includes geospatial and structo-spatial. It is data where location is benefit but does not have to be geographical.
  • Integrated Operational Data: This consists of operational data sets and covers a business. It is subject-oriented, integrated and time-current.
  •  Redundant Data: It is duplicate data which is stored in multiple data sites. This has to be taken care of to make sure that information quality is maintained.
  • Integrated Historical Data: Historical data is important to keep. It is composed of many different data types and comes from different sources hence it is very important to integrate and maintain.
  • Foredata: This is the data that is developed from before and consists of data about objects and events and every data that any official interacts with.
  • Legacy Data: It comes from virtually anywhere and support legacy systems. It includes hierarchical, XML, network, object and is also called disparate data.
  •  Demographic Data: This deals with the human population data. It represents identification, location, gender and other factors
Growth of Data over time

Data Warehouse is a traditional method to integrate data. Data is extracted, transformed and loaded into a data warehouse. Though it is extremely difficult to manage data from different sources a Data Warehouse has its benefits and limitations:


Data Warehouse Structure
Benefits of a Data Warehouse:
  • Since it provides better access to information, better decisions can be made based on it  
  • Tighter control of the data and better security
  • Timely access to information
  •  Provides the ability to quickly analyze data
  • High query success
Limitations of Data Warehouse:
  • Data comes in various forms and is stored on different systems. It is difficult to integrate the data and the same time is time consuming. It requires intensive manual processing
  • Unstructured data cannot be stored and also there are no methods to store real time data
  • No central place available to view the data  
  • No automated way to build reports based on the data
  • Data is static and dated
  • Limited flexibility for different types of users as it requires separate data marts for different types of users
  • Additional time and high costs are associated with adding new data sets  
  • Security is a major issue as data owners lose control over their data
  • High initial implementation costs 

Future of Data Warehousing:
Data Warehousing is a very common technique used by organizations to get insights from data. It helps them integrate data from multiple resources and allows processing of millions of data rows. With the advantages that it has there are several disadvantages as well. In the future these disadvantages should be mitigated.
In this era where everything is cloud based the future of data warehousing should be integrating with the cloud. Cloud based warehousing will allow data analytics to be provided through various private or public cloud. This will not only allow them to have bigger data storage space but at the same time organizations can customize their data storage needs. This cloud based model will also allow organizations to have access to analytics from anywhere and everywhere.
Cloud based Data Warehouse
Also data warehousing should be able to incorporate real time unstructured data in order to make sense out of it and help organizations make better decisions. In this age where everything is going digital it is critical for companies to capture this digital data and hence data warehousing should support Big Data- Social Media Analytics.

References:





No comments:

Post a Comment