Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This isn't big data at all. This is stuff you'd use AFTER you've processed your big data & obtained a reduced dataset result that was small data, of the order of a few hundred MB. To process said big data in the first place, of the order of a few PB, typically you are going to use hdfs & at a programmatic level, an api like cascading or better yet, scalding, so you end up needing only that last item D3.js , with everything else existing as scalding source.


This is where "Big Data" becomes about as useful as a penis measuring contest. The vast majority of business data analysis happens at MS Excel levels of data. I have done analysis on several GBs of data using R, and those data-sets are easily the 99th percentile of data size in my field. For datasets that are bigger than that, SQL is still pretty damn good. And when you get to the point where SQL starts to break down in usefulness, you are at a level where you can start sampling and still get perfectly usable results in almost all use cases.

In my experience, "Big Data" is a misnomer. What it really means is fast data. You have a repeatable calculation to perform on datasets that are perfectly fine chugging away on an Oracle cluster for 3 hours, but you want it done in 20 minutes. That is what Big Data is really used for. Anything else is just marketing hype.


Some of the tools in the article are enabling businesses to capture new types of data by making data collection & analysis cheaper and easier. For example, user interactions in an app. User interactions can happen hundreds of times per session, across thousands of devices.

Most businesses don't have to deal with those data sets, because they haven't had the opportunity to record that level of detail yet.

I think one of the most exciting things about "Big Data" (ugh, so buzzy) is that there are so many new opportunities for data gathering and analysis. Now that data storage is so much cheaper, we can record more stuff.

Full disclosure, I work at one of the companies in the list (Keen IO). We didn't know the article was coming out; not sure how they came up with the list or where they got the content (the description of what we do is a bit outdated).


No. Big data has a formal definition, but one that most people ignore:

When the size of the data becomes part of the problem.

(An O'Reilly author said this once, I forget his name.)

For example, physicists in the 80's who had 10s of MBs of data had a big data problem.

Nowadays, this typically means out-of-core data sets.


That definition is but one of many supposed formal definitions. And it isn't a particularly good one, because then "Big Data" becomes a definition that can only be quantified on a person-level of granularity. According to the guy with the Marketing degree from Ho-Hum College, anything bigger than can fit in an Excel spreadsheet is "Big Data". Switch to Doug Cutting, and "Big Data" is in the hundreds of Petabytes.

I don't buy the out-of-core definition either. I work using a Data Warehouse with petabytes of information...it certainly doesn't fit in memory. I don't use Hive, Pig, Cascading, etc...(okay, sometimes I use Cascalog, but not as a strategy for dealing with large amounts of data). I use SQL. And it works perfectly fine. But if you ask any of the people out there talking about "Big Data", an SQL database doesn't fit into the definition. Hell, I have done processing on a 200GB CSV file using nothing more than GAWK. Nobody is calling GAWK a big data tool.

Face it. "Big Data" is a buzzword for CIOs that read magazines for CIOs but still need to find an engineer to set up their email on their iPad.


Precisely, the term big data is about selling stuff to CIOs

Nobody at yahoo went "we need hadoop to deal with our Big Data problem", it was simply a very large amount of data with relatively limited budget problem, and plenty of very large companies are happy using teradata or netezza to manage PB of information.

The new set of tools are often brilliant, but the problems that they solve are almost all not new.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: