什么是大数据基础

By Mark Wills

The first thought that comes to mind about "Big Data" is that there is a lot of it. And while that is true, "Big Data" is more than that. It tries to address the complexity of being able to bring together both structured and unstructured data from an increasing variety of sources so it can be analysed in a concise and coherent way with a high degree of confidence.

关于“大数据”的第一个想法是,其中有很多。 尽管这是事实,但“大数据”远不止于此。 它试图解决能够将来自越来越多的各种来源的结构化和非结构化数据整合在一起的复杂性,因此可以以简洁,一致的方式高度自信地对其进行分析。

Not so long ago, Big Data was certainly considered to be in the realm of "for the very big corporations", but that is starting to change. Part of the reason for the change is the technology associated with all the hype and the latest of buzz words becoming more available to the not so big. There are now viable choices of solutions that won't break the bank.

不久之前,大数据当然被认为是“为大公司服务”的领域,但是这种情况正在开始改变。 发生这种变化的部分原因是与所有炒作相关的技术和最新的流行语变得越来越多,而对于那些不太大的话题。 现在有不折不扣的可行解决方案选择。

Because the "Big Data industry" is still in its infancy (relatively speaking), there is a small luxury of time before it becomes the expected "norm". That time can be well used to make sure you can learn all there is to know and decide if you will actually benefit from it.

由于“大数据行业”仍处于起步阶段(相对而言),因此在成为预期的“规范”之前还有很短的时间。 这段时间可以很好地用于确保您可以学习所有要了解的知识,并确定您是否会真正从中受益。

One can see the anticipation of great things to follow in marketing departments the world over, talking up "big data" as if they have a clear idea about what it is, how to get it, and how it can be used. Importantly, you need to know what it might mean for your business as a competitive advantage or how disadvantaged you will quickly become if you are not currently thinking about it.

人们可以看到世界各地的营销部门都期待着伟大的事情,谈论“大数据”,就好像他们对它是什么,如何获得以及如何使用有清晰的认识。 重要的是,您需要知道这对于您的业务来说可能是一项竞争优势,或者如果您目前不考虑这一点,那么您将很快变得处于不利地位。

Taking a small step back in time, for years, when we needed to analyse our ERP systems with maybe a smattering of other structured databases (like CRM) from within an enterprise, we used some kind of Business Intelligence system (BI).

退后一步,多年来,当我们需要分析企业内部的其他结构化数据库(例如CRM)的ERP系统时,我们使用了某种商业智能系统(BI)。

Arguably, a lot of BI systems fell a little short of the deliverables because of a predisposition toward post analysis. They were only able to report or predict based on what was actually captured in fairly traditional sources like invoicing and accounts receivable. But we all know there is a lot more information out there -- information that adds a new dimension to the traditional data and a more realistic perspective of the enterprise.

可以说,由于倾向于后期分析,因此许多BI系统都比可交付成果差一点。 他们只能基于发票和应收帐款等相当传统的来源中实际捕获的内容进行报告或预测。 但是我们都知道还有更多的信息-这些信息为传统数据增加了新的维度,并为企业提供了更现实的视角。

Think about your own systems in use, and all different methods available to interact (POI Points of Interaction) with your company. For a start, you have the obvious in-premise solutions, but think broader. Maybe you have a website being accessed to create orders, or log issues. Let us suppose a range of different devices use to access that -- the desktop, mobile devices or maybe even telemetry. And more recently, your Marketing team launched the corporate social sites, and let us not forget the slightly more traditional forms such as EDI and Telephony.

考虑您正在使用的自己的系统,以及可以与您的公司进行交互的所有不同方法(POI交互点)。 首先,您拥有明显的内部解决方案,但请考虑更广泛的范围。 也许您正在访问一个网站以创建订单或记录问题。 让我们假设使用各种不同的设备进行访问-台式机,移动设备甚至遥测。 最近,您的营销团队启动了公司社交网站,让我们不要忘记稍微更传统的形式,例如EDI和Telephony。

Looking at the variety of different POI, we realise that those points are being monitored or logged somehow. Those new sources of information are being stored in computer logs, Facebook or Twitter feeds from the corporate social sites, geographic information, cookies, activity logs, and Clickstream data. Now combine them with those traditional sources and now you have grown out of BI and entered into the world of "Big Data".

通过查看各种不同的POI,我们意识到以某种方式正在监视或记录这些点。 这些新的信息源存储在公司社交网站的计算机日志,Facebook或Twitter提要,地理信息,Cookie,活动日志和Clickstream数据中。 现在将它们与传统资源结合起来,现在您已经摆脱了BI的束缚,进入了“大数据”世界。

When you start to really think about your business and everyone it touches, imagine the coincidental information available. That's the information from various logs and devices themselves and is not restricted just to what people have been entering on your site. In includes their IP addresses and datetime activities, their clickstream data, their mobile geo-locations services, and tracking information via on-board telemetry from vehicles. Most importantly, that coincidental and associated data is being collected by machines logging the activity (at a significant frequency) whilst fulfilling other tasks. And, as we automate more functions, there is an ever increasing diversity as to what can be captured.

当您开始真正考虑您的业务及其涉及的每个人时,请想象一下可用的巧合信息。 那是来自各种日志和设备本身的信息,不仅限于人们在您的网站上输入的信息。 其中包括其IP地址和日期时间活动,其点击流数据,其移动地理位置服务,以及通过车载车载遥测进行的跟踪信息。 最重要的是,这些巧合和相关的数据是由机器记录活动的机器(以很高的频率)同时执行其他任务的。 而且,随着我们实现更多功能的自动化,关于捕获内容的多样性也在不断增加。

Thinking about the coincidental data, it becomes quite significant when associated with, for example, geographic locations. The "coincidental" part transforms into strategic data revealing geographic market strengths and opportunities. Combine that with various sources of feedback, and you expose vulnerabilities and consumer sentiment.

考虑到巧合数据,当与例如地理位置相关联时,它变得非常重要。 “巧合”部分转换为战略数据,揭示了地理市场的优势和机会。 将其与各种反馈源结合起来,即可发现漏洞和消费者情绪。

One particular scenario I worked on with a large corporation was dealing with excessive warranty work and wanting to gain insights into the "real" consumer for their product. The company dealt with resellers and agents (dealers), so it was always difficult find out what happened next in the retail space. We were able to gather data via the myriad social forums and dealer logs about the real consumer. Gaining insights into the consumer space revealed a few significant issues that were relatively easy to solve. Without a "Big Data" attitude, the information flow was trapped in the reseller domain.  

我与一家大公司合作的一种特殊情况是处理过多的保修工作,并希望深入了解其产品的“真正”消费者。 该公司与转售商和代理商(经销商)打交道,因此始终很难找出零售空间中接下来发生的事情。 我们能够通过各种社交论坛和经销商日志收集有关真实消费者的数据。 对消费者领域的洞察发现了一些相对容易解决的重要问题。 没有“大数据”态度,信息流就被困在经销商域中。

That's part of the problem with "Big Data". It is a big buzzword, and it is full of big ideas, and needs new attitudes toward managing and processing data.

那就是“大数据”问题的一部分。 这是一个大流行语,充满了大创意,并且在管理和处理数据方面需要新的态度。

As we have said before, it is not just size, it is the variety of potential sources that really generates the volume. So we now find a need to make sense of the variety of data. That can mean formalising data relationships, or extracting elements from unstructured data, or undergoing various transformations so it can be used.

正如我们之前所说的,不仅仅是大小,真正产生体积的是各种潜在来源。 因此,我们现在发现有必要理解各种数据。 这可能意味着形式化数据关系,或从非结构化数据中提取元素,或进行各种转换以便可以使用它。

Take, for example, our large corporation having warranty issues with resellers. The company's customer is the reseller, and the reseller has their own customers, which we know as consumers. However those individuals are also known to our company as a registered user name via the website, and something different again on the corporate social media sites. So, how do we get all those different identifiers to mean one and the same thing? The business must define rules for the different and sometime disparate data sources.

例如,我们的大型公司在经销商方面存在保修问题。 公司的客户是转销商,转销商有自己的客户,我们称为消费者。 但是,这些人也被我们公司称为通过网站注册的用户名,在公司社交媒体网站上又有所不同。 那么,我们如何使所有这些不同的标识符表示同一件事呢? 企业必须为不同的,有时是完全不同的数据源定义规则。

That is the first real challenge: creating a business dictionary that defines the data correctly, consistently and uniformly. The business also needs to understand what data elements are available and what that can translate to in terms of achieving business goals.

那是第一个真正的挑战:创建一个业务字典来正确,一致和统一地定义数据。 业务还需要了解哪些数据元素可用以及在实现业务目标方面可以转换为哪些数据。

With all the different data sets coming together, we need disk space. Potentially lots of it. It has to store the individual data elements and allow for any new data feeds. There are technology solutions, and arguably a contributor to the rise in "big data" popularity could be all the discussion around cloud based solutions. An enterprise doesn't have to build its own ginormous data centre -- but that might be a more viable alternative, depending on the business.

将所有不同的数据集放在一起,我们需要磁盘空间。 可能很多。 它必须存储各个数据元素,

Then there is analysing the data. Getting results that are reliable, trustworthy, usable and repeatable takes a new kind of thinking (as a technologist) and very clear goals set by the business. I say usable, because one of the possible risks with the variety of data sources available is a perceived or real possibility of contradicting privacy clauses, proprietary rights, copyright and ownership of data, and how all of those impact the marketing and selling of data.

然后是分析数据。 获得可靠,可信赖,可用和可重复的结果需要一种新的思维方式(作为技术专家),并且需要企业制定非常明确的目标。 我之所以说是可行的,是因为存在各种可用数据源的潜在风险之一是与隐私条款,所有权,数据的版权和所有权以及所有这些因素如何影响数据的营销和销售相抵触的可感知或真实可能性。

Fortunately, there is a lot of information about "Big Data" being written out there, and a quick search can yield a heck of a lot of information (did I mention that Bing reckons they analyze over 100 petabytes of data to deliver their "high quality" search results?).

幸运的是,那里写了很多有关“大数据”的信息,快速搜索可以产生大量信息(我曾提到Bing认为他们分析了100 PB的数据来交付“大数据”。质量”搜索结果?)。

One thing you will find is reference to the three "Vs"

您会发现一件事是引用三个“ V”

Volume. Many factors contribute to the increase in data volume.

卷。 许多因素导致数据量的增加。

Velocity. Data is streaming in at unprecedented speed

速度。 数据以前所未有的速度流入

Variety. Data today comes in all types of formats

品种。 当今的数据具有各种类型的格式

It is a phrase / a term first penned by Doug Laney in 2001 before "Big Data" became the current "hype" in Volume-Velocity-and-Variety.pdf and rather poetically used to describe "Big Data".

这是一个短语/术语,由Doug Laney于2001年首次提出,之后“大数据”成为了Volume-Velocity-and-Variet中当前的“炒作” y.pdf ,颇具诗意地用于描述“大数据”。

The other is a term Hadoop, which is basically a software platform that controls data across a wide range of machines and worth of a separate article in and of itself.

另一个是Hadoop术语,基本上是一个软件平台,可控制各种机器上的数据,其本身也包含单独的文章。

The other thing you will find are all the major suppliers offering up their own summations and recommended reading and how they support Hadoop and/or other technologies. So the first stop for a lot of information would be your preferred hardware or database supplier. Here is a couple of links to get you started.

您会发现的另一件事是所有主要供应商都提供了自己的总结和推荐阅读内容,以及他们如何支持Hadoop和/或其他技术。 因此,获取大量信息的第一站将是您首选的硬件或数据库供应商。 这里有几个链接可以帮助您入门。

IBM : Bigdata-Enterprise

IBM: 大数据企业

SAS : Big-Data

SAS: 大数据

ORACLE : Big-Data

ORACLE: 大数据

MCKINSEY : big_data_the_next_frontier_for_innovation

MCKINSEY: big_data_the_next_frontier _for_innov 信息化

MICROSOFT : business-intelligence big-data

微软: 商业智能大数据

Now a cautionary tale... Big Data is not just gathering everything you can. That only becomes (quite simply) lots of data. People make the mistake of believing they must have Information because of all the data they are gathering. But a lot of the time, it is nothing more than consuming disk space for the sake of data collection and has no strategic business value in terms of Information (insight). Have a read about the NSA's dilemma. (You might sleep better too.)

现在是一个警示性的故事……大数据不仅是在收集您所能拥有的一切。 那只会变成(相当简单)大量数据。 人们会犯错误,因为他们收集的所有数据都认为他们必须拥有信息。 但是很多时候,这仅仅是为了收集数据而消耗磁盘空间,并且就信息(洞察力)而言没有战略业务价值。 阅读有关NSA困境的信息 。 (您可能也睡得更好。)

So, beware the hype, and take the time to understand before the boss walks in with the next "Big Idea". I hope this brief introduction inspires you to seek out more information about "Big Data".

因此,要当心炒作,并花时间理解老板,然后再提出下一个“大创意”。 我希望这个简短的介绍能激发您寻找有关“大数据”的更多信息。

翻译自: https://www.experts-exchange.com/articles/12816/What-is-Big-Data.html

什么是大数据基础

Logo

技术共进,成长同行——讯飞AI开发者社区

更多推荐