Editor’s note: Markus Ehrenmueller-Jensen is the author of Microsoft Business Tools, and is well positioned to provide context on ways users can better leverage these tools. -bg
Big Data
I can remember back in the beginning of my own school time, what a sensation it was, when I could upgrade from a “normal” 5 ½” floppy disk with a capacity of 360 KB to some with HD. These offered with 1.2 MB multiple of space. My first personal computer with an 80286 processor had a hard disk with a capacity of 40 MB, which I later complemented with an additional 80 MB hard disk. Neither a modern operating system nor my job-related documents nor my digital pictures of my family could fit on such a disk today. Predictions of the annual growth-rate of data is a matter of upward revision by every year. Rafal Lukawiecki is right, when he told at Microsoft Day 2013 at Hofburg, Vienna, that “Todays Big Data is tomorrows Little Data”.
Interpretation
Today we do not have the problem, that we cannot record needful data within the company or gather data from external sources. We are online almost 24/7 through smartphones and tablet computers. But the challenge is to overcome the masses of data and to filter valuable information from it. For example:
· Log files
· Sensor technologies and “internet of things”
· Pictures and videos
· Stock exchange
· Comments in social media
· Orders through a web shop
· Demographic statistics´
Those data accumulations are primarily used to “better understand customers and markets, better manage risks and make better business decisions” (BARC 2013).
On the one hand, data is flooding onto us; on the other hand, the half-value period of data sinks rapidly. Sales amounts from the last year, the last month or even the last week might be over aged and of no business value. This means, that it is important to analyze data very quickly.
Classical concepts of data warehouses have troubles to cope with such challenges. An ETL-process is not possible anymore overnight; pre-aggregation would contradict the real-time character of data. In good old times, it was OK to start a batch-job overnight; today the user is disappointed when he does not get the answer of a query within minutes or even seconds. Quick responses are crucial to intuitive analytics. End-users expect a high level of flexibility and wicked fast queries today.
Fortunately, we can build on new but proven technologies (s. below) to store data without knowing how it will be queried later on. A common data warehouse would not allow doing this; tables and columns must be prepared and optimized so that joins can succeed when they are queried. When something is missing in the data warehouse’s concept, we will get in troubles at a later point in time.
Three V’s
“Big Data” is a buzzword now. This leads to the fact, that everyone has a different concept in mind, what the term should actually mean. The most common interpretation is that data is “Big Data” when it fulfills at least one of the following qualities ( see Laney and also s. Barlow):
· Volume (much of data)
· Velocity (rapid growth of volume)
· Variety (unstructured or complex structured)
Those three qualities are all, but new: Data with such characteristics always existed. The problem rises, when volume, velocity or variety asks too much of the available technical solutions.
Therefore, BARC (s. above) defines “Big Data” not through data itself, but as methods and techniques to collect, store and analyze polystructured data in a high scalable way. That is the point: New are the methods and techniques that are available today and not the quality of the data.
High scalability guarantees, that even huge amounts of data and rapid growing masses of data can be handled. As long as the data is structured, it can be processed in relational databases. The challenge comes in, when relational data has to be joined with polystructured data, which is more or less unstructured.
Many companies have solved those problems: Microsoft Bing analyzes 100 Petabytes of data to answer search requests; Twitter and Facebook are sending messages from millions of users to their followers and friends in real-time. Online betting services allow live-bets during Super Bowl.
In the world of relational data, we are using indexes and aggregations to achieve a decent query performance. However, the overhead to maintain those additional database objects have a drawback and can over-compensate the gain in performance at some point. Eventually the maintenance window may be too short to proceed with all necessary tasks of loading the data warehouse, rebuild the indexes and recalculate the data marts.
Scale-up vs. Scale-out
Adding an 80 MB hard disk to my first personal computer or replace it later by a new model with a better processor is typical for (vertical) scale up. We use this strategy nowadays, even with our servers. If the server touches the boundaries of its capacity, we substitute it with better hardware: faster processor, more memory, etc.
The hottest trend concerning scale-up comes with in-memory technologies. Here it is important to have enough memory– the response time of the hard disk is less important, because after an initial load everything stays in memory. Data warehouses benefit from vertical indexes (column store) which allow high compression rates (which allow holding a big table in memory) and allow optimization for parallel operations. Even OLTP systems can hold tables and procedures in memory, without sacrificing ACID. Here do (almost) neglect able access times make it possible to reduce database overhead and improve performance even more.
The advantage of a pure server-side concept in scaling up is that one’s applications do not have to be touched. The problem of this approach is that not every component can always be upgraded to a better one. If you already own the best server, you do not have potential for further improvements.
Here comes (horizontal) scale-out into the game. You do not improve performance by buying better, but simply more hardware, which will be operated in parallel. Though commodity hardware at moderate prices is used for this approach, this does not mean that the solution will be cheaper. The point is that you have to touch your applications and administrating the hardware nodes is not for free, neither.
Hadoop
To analyze Petabytes of data is no challenge for Hadoop. This open-source tool is the best known when it comes to Big Data. This is the reason why Microsoft built Hadoop-connectors for SQL Server and supports the open source project officially.
Hadoop’s concept for Big Data is called “MapReduce”. In a first step, requests are mapped to key/value pairs, which will be distributed to parallel cluster nodes. In a second step, the results are merged (reduced) by the key. This technical approach is very fast and flexible. The downside is that we have to convert relational queries (SQL) to MapReduce-queries.
Appliances
Microsoft made efforts in the last couple of years to ally with HP and Dell to offer combined packages of matched hard- and software. In the case of the Parallel Data Warehouse appliances can be scaled-out from a quarter rack up to 64 racks. The hardware is shipped with a special version of SQL Server which responses to request like any other edition of SQL Server. Every application who can communicate with a “normal” SQL Server can also with Parallel Data warehouse.
However, the Parallel Data warehouse is very different on the inside. It comes with an implementation of Hadoop, called “PolyBase” which internally works with MapReduce and not with SQL, with the xVelocity technology (to store data columnwise) and with project Hekaton (in-memory technology for OLTP implementations). Nevertheless, only administrators would recognize that this implementation of SQL Server differs from others (because of new and shiny knobs they can fiddle with). No developer and no end-user has to be trained to use Parallel Data Warehouse – for them it is just another SQL Server as they already know it. This is unique in Microsofts’s approach compared to its competitors, who offer dedicated hard- and software solutions, which are not seeminglessly integrated into the relational database.
Self-Service BI
“Big Data” has replaced “Self-Service BI” as the hottest buzzword now. Nevertheless, using Big Data may not destroy the convenience of our end-users. They still want to analyze Big Data in the nice and crisp way, as they analyze “normal” data. As mentioned above: Microsoft did integrate Big Data into SQL Server seeminglessly and therefore all of the tools end-users know, work as usual with Big Data:
· Excel (with and without a PowerPivot Data Model)
· Reporting Services
· PerformancePoint Services
· Power View
Every of the tools has their characteristic strengths and weaknesses. If you want to know more about them, check out my book “Microsoft Business Intelligence End-user Tools 360°”.
Conclusion
Not the problems known as “Big Data” (volume, velocity and variety) are new, but the technologies we have, to find solutions for those problems. Microsoft’s approach is through three layers:
· Administration, which supports all kind of data (structured, semi-structured and unstructured data from relational, non-relational, analytic and streaming data sources) which are gathered within SQL Server as the reliable product.
· Enrichment of data by joining external data sources with internal analysis, which rely on relational, multidimensional, tabular and polybased technologies.
· Business analytics through tools, which end-users are familiar and already use (Office and SharePoint).
This clever architecture enables developers and end-users to use “Big Data” as it would be “normal” data.
Sources
BARC Big Data Survey, Februar 2013
Barlow, M. (2013): Real-Time Big Data Analytics. Emerging Architecture. O’Reilly.
Related articles
- Cloudera: Extract Benefit From All Your Data (ctolabs.com)
- It’s Time: Cloudera Takes Aim at the Legacy Data Management Solutions Market (ctolabs.com)
- Is Hadoop Not Enough For Cloudera? (readwrite.com)
- Hadoop, Java Combo Top of Mind for Hiring Managers (eweek.com)
- Watch the 90 Second Platfora demo here (ctovision.com)
- Cloudera and Platfora Leveraged to Address Hard Challenge: What do “they” know about my network? (ctovision.com)
- What The Non-Technical Person Needs To Know About Platfora (ctolabs.com)
Leave a Reply
You must be logged in to post a comment.