大数据数据量估算
(Notes: All opinions are my own)
(注:所有观点均为我自己)
介绍 (Introduction)
Data collection is the initial and fundamental step in any Data Science or Analytics project, and on which all following activities rely, from data analysis to model deployment.
数据收集是任何数据科学或Analytics(分析)项目中的第一步,也是基础步骤,从数据分析到模型部署,所有后续活动都依赖于此。
With the pervasive presence of APIs and Cloud Computing, I am ever more intrigued in maximizing the efficiency and level of automation of data collection activities for both work and personal projects.
随着API和云计算的普遍存在,我对将工作和个人项目的数据收集活动的效率和自动化水平最大化实现了极大的兴趣。
In the latter category, I have been interested in collecting data from online home-rental platforms in the UK market (Zoopla, RightMove, OnTheMarket, and similar) with the aim of extracting image and text data to be processed for use in machine learning models (for use cases such as prediction of a property’s price, extraction of key features from image-data to infer a listing’s true value, processing of customer reviews through NLP techniques, etc..)
在后一类中,我感兴趣的是从英国市场( Zoopla , RightMove , OnTheMarket等)的在线家庭租赁平台收集数据,目的是提取要处理的图像和文本数据,以用于机器学习模型。 (对于用例,例如预测房地产价格,从图像数据中提取关键特征以推断出房源的真实价值,通过NLP技术处理客户评论等)。
In the following lines, I aim to discuss how to potentially go about:
在下面的几行中,我旨在讨论如何实现:
-
The identification of the most critical data sources
识别最关键的数据源
-
The estimation of data collection costs should you want to put your solution to commercial use
如果您要将解决方案投入商业使用,则需要估算数据收集成本
I gave the article a broader cut, which touches upon market and regulatory considerations to be made when reasoning around data collection for potentially commercial purposes, as well as the more technical considerations of working with APIs, as I realize there are multiple layers to be surfaced within this very interesting topic.
我对文章进行了更广泛的介绍,其中涉及了出于潜在商业目的而进行数据收集推理时要考虑的市场和监管方面的考虑,以及涉及API的更多技术方面的考虑,因为我意识到要浮出水面在这个非常有趣的话题中。
I hope the below key points will result useful in setting up the Data Collection block of your current and future Data Science projects, no matter your industry focus.
我希望以下要点将有助于您建立当前和将来的数据科学项目的数据收集模块,无论您关注的是行业如何。
做市场调查并确定您的关键数据源 (Do your market research & identify your key data sources)
In two-sided markets such as online home rental platforms, which are dominated by supply and demand agents (on the supply side, homeowners looking to rent, either directly or through a real-estate agent; on the demand side, individuals looking to rent), you are going to find the most data, both in terms of quantity and quality, on those platforms which drive the majority of traffic in a given market, from both supply and demand sides.
在双向市场(例如在线房屋租赁平台)中,供求代理占主导地位(在供应方面,希望直接或通过房地产代理进行租赁的房主;在需求方,希望进行租赁的个人) ,您将在驱动特定市场中来自供需双方的大部分流量的平台上找到数量和质量方面最多的数据。
In this sense, you need to identify the platforms which hold the majority of market power as they pull and attract most eyeballs. Knowing the market’s distribution of overall traffic/data volume is very useful if you are looking to pull high amounts of data over time, and do not want to be integrating multiple data streams coming from smaller market players.
从这个意义上讲,您需要确定在吸引和吸引大多数眼球的同时拥有大部分市场力量的平台。 如果您希望随时间推移获取大量数据,并且不想集成来自较小市场参与者的多个数据流,则了解市场的总体流量/数据量分布非常有用。
In the UK’s online home rental market, the majority of the traffic and listings is distributed between the top 1–5 players, and those companies (the left of the curve in the below illustrative distribution) are therefore the ones on which you want to focus your data collection efforts on.
在英国的在线房屋租赁市场中,大部分流量和列表都分布在排名靠前的1-5个参与者之间,因此,您要关注的公司(以下示例性分布中曲线的左侧)您的数据收集工作正在继续。

This is of course a double-edged sword, as the big players from which you are going to be sourcing from have high leverage when it comes to entering data-sharing agreements, which allows them to:
当然,这是一把双刃剑,因为要签订数据共享协议时 ,您将要从中采购的大型参与者具有很高的杠杆作用 ,这使他们可以:
1) act as de-facto gatekeepers to a particular market and set their own data usage policies, especially in a less regulated market scenario
1)充当特定市场的事实上的守门人,并制定自己的数据使用策略,尤其是在市场监管不严格的情况下
2) charge more per the same unit of data volume when entering data sharing agreements
2)签订数据共享协议时,按同一单位数据量收取更多费用
3) effectively monitor potential competitive threats to their core-business from startups who require access to their data and who are thus more dependent on their services
3)有效监控那些需要访问其数据并因此更加依赖其服务的初创公司对其核心业务的潜在竞争威胁
At the same time, given a skewed distribution of market share and in the absence of enforcing anti-competitive regulation, this is where the true value of the data resides, and thus aspiring Data Science teams which want to put their hands on this data need to pay a price to tackle the majority of the market and access high volume, high quality data points.
同时,由于市场份额的分配存在偏差,并且没有实施反竞争法规,这就是数据的真正价值所在,因此,有志向的数据科学团队希望将他们的手放在这一数据需求上付出一定的代价来应对大多数市场,并获得大量,高质量的数据点。
N.B For non commercial or research purposes, you are probably OK just scraping data off these websites (although the activity is not always appreciated when done at high frequency and volume — this is purely a practical consideration, I do not encourage web scraping on websites which have policies against it, and you are always better off respecting the terms and conditions of the data provider).
注意:出于非商业或研究目的,您可能只是从这些网站上抓取数据就可以了(尽管以高频率和高流量进行操作时并不总是能体会到这种活动-纯粹是出于实际考虑,我不鼓励在这些网站上进行网络抓取有反对的政策,那么您始终最好遵守数据提供者的条款。
始终先寻找API (Always look for APIs first)
Once you have identified the main data sources, your first bet is looking through their developer resources and figuring out:
一旦确定了主要数据源,您的第一个赌注就是浏览他们的开发人员资源并弄清楚:
-
Whether they have an active API from which you can pull the data you need
他们是否具有活动的API,您可以从中提取所需的数据
-
What their overall data sharing terms and conditions (T&Cs) are
他们的总体数据共享条款和条件(T&C)是什么

Zoopla, for example, has an API page, which can be useful to return a few features and listings data. Zoopla’s specific API has not being updated in a while and has apparently drawn criticism previously documented on Medium, but this type of information is what you want to look for when comparing different data sources.
例如, Zoopla有一个API页面,可以用于返回一些功能和列表数据。 Zoopla的特定API暂时没有更新,并且显然引起了先前在Medium上记录的批评,但是当您比较不同的数据源时,您需要查找此类信息。
When moving on to RightMove, you are directed to their Data Services page, per their official website. They do not seem to have or authorize any official API at the time of writing. OnTheMarket.com also does not seem to have any API as well.
转到RightMove时,您将通过其官方网站转到其“ 数据服务”页面。 在撰写本文时,他们似乎没有或未授权任何官方API。 OnTheMarket.com似乎也没有任何API。
Checking the main players is incredibly useful to determine your next steps in your data collection strategy. You can get some sample data if you find an active API and decide:
检查主要参与者对于确定数据收集策略中的下一步非常有用。 如果找到有效的API并决定以下内容,则可以获得一些示例数据:
-
Whether the data volume and quality is enough for your application
数据量和质量是否足以满足您的应用程序
-
Whether you are in violation of their T&Cs
您是否违反其条款和条件
-
Whether you want to get in touch with the Data Providers (see next steps) to submit a format data request to obtain further and hopefully richer datasets
是否要与数据提供者联系(请参阅后续步骤)以提交格式数据请求以获取更多(希望是更丰富)的数据集
-
Whether to move on to other smaller players in the market which may give you enough data (via their own API) to start off with (other aggregators such as Nestoria, which does provide one)
是否转向市场上其他较小的参与者,这可能会(通过他们自己的API)为您提供足够的数据作为开始(其他类似 Nestoria的 聚合器( 确实提供了这一点))
No matter the case, do not skip this step as it provides very valuable information, even if you are not immediately given access to what you need.
无论如何,即使您没有立即获得所需的信息,也不要跳过此步骤,因为它会提供非常有价值的信息。
不要害怕与数据提供者联系并讨论潜在的数据共享协议 (Don’t be afraid to get in touch with data providers and discuss potential data-sharing agreements)
In my case, I decided to dig a bit deeper and thus got tentatively in touch with RightMove & Zoopla, via email and LinkedIn, by searching for Analytics roles and by reaching out to viable prospects.
就我而言,我决定进行更深入的研究,并通过电子邮件和LinkedIn来搜索Right Analytics和Zoopla,并通过搜索Analytics角色并寻求可行的潜在客户来暂时联系。
I recommend doing this as you can always find people on the other side who are interested in supporting developers and hearing out interesting use cases. You may also uncover information which you did not previously noticed while reading through the various documentations.
我建议您这样做,因为您总是可以在另一侧找到对支持开发人员和听到有趣的用例感兴趣的人员。 您可能还会发现在阅读各种文档时以前没有注意到的信息。
In my case, I found RightMove to be very restrictive of their data’s usage, and thus the only thing I really obtained from them was a cold shoulder. Same with Zoopla, which merely referred me back to their existing API, whose data richness I doubted after having tested it briefly with a Python script.
就我而言,我发现RightMove限制了他们数据的使用,因此,我真正从他们那里获得的唯一一件事就是冷漠的肩膀。 与Zoopla一样,后者只是让我回到了他们现有的API,在使用Python脚本对其进行了简短测试之后,我对它的数据丰富性表示怀疑。
At this point, I decided to search online to identify applications and platforms which already made use of data coming from either one of the two main providers, and see if I could extract further information on how they had done so and potentially at what cost.
在这一点上,我决定在线搜索以标识已经利用了来自两个主要提供商之一的数据的应用程序和平台,并查看我是否可以提取有关他们这样做的进一步信息以及潜在的成本。
I could have also doubled down on Zoopla & RightMove and decided to propose a data-sharing agreement, but as a single individual, how much leverage would I realistically possess in such a conversation?
我本可以对Zoopla和RightMove进行一番研究,然后决定提出一项数据共享协议,但是作为一个人,我实际上可以在这种对话中拥有多少杠杆作用?
In similar cases in which you are trying to decide where and how to collect your data from, I suggest you either:
在尝试确定从何处以及如何收集数据的类似情况下,我建议您:
-
Take your time with researching the market and various data providers, and give yourself as many potential data sources as possible, which will also allow you to compare their costs against the budget you are willing to allocate to your project
花些时间研究市场和各种数据提供者,并给自己尽可能多的潜在数据源,这也使您可以将它们的成本与您愿意分配给项目的预算进行比较。
-
Take you time to establish a relationship with the few providers of choice (if they do not necessarily have a clear-cut API, such as in this case) and extract as much price/other information from them, while also being very transparent in the use you plan to make of their data (research, commercial, personal, etc.)
花一些时间与所选的少数提供者建立关系(如果它们不一定具有明确的API,例如在这种情况下),并从它们中提取尽可能多的价格/其他信息,同时在提供者中也非常透明使用您打算利用其数据(研究,商业,个人等)的数据
利用您之前收集数据的其他人的专业知识 (Leverage the expertise of others who have collected the data before you)
After having identified your main data sources and having checked for APIs and their usage potential, you’d also want to reach out to other market players who are exploiting those same data sources and see if you can uncover further insights.
在确定了主要数据源并检查了API及其潜在用途之后,您还希望与其他正在利用相同数据源的市场参与者建立联系,看看您是否可以发现进一步的见解。
I found this to be an incredible little steps in getting some great-quality contextual information around data collection costs.
我发现这是获取有关数据收集成本的高质量上下文信息的令人难以置信的小步骤。
For example, I found a great website, Property Data, which cites the same data sources I was looking for, and thus I immediately sent an email using their contact form.
例如,我发现了一个很棒的网站Property Data ,它引用了我一直在寻找的相同数据源 ,因此我立即使用他们的联系表发送了一封电子邮件。
To my surprise, the founder himself replied, mentioning the amount of money one provider was charging PropertyData to get them what they needed, as well as confirming they had not been able to convince another provider to send over their data, no matter the price point proposed, thus confirming my previous negative experience when reaching out to most of them via email/LinkedIn.
令我惊讶的是,创始人本人回答说,提到一家提供商向PropertyData收取的费用,以获取他们所需的东西,并确认无论价格高低,他们都无法说服另一家提供商发送其数据。建议,从而证实了我以前通过电子邮件/ LinkedIn与大多数人联系时的负面经历。
-(below is the extract from the email response I got from PropertyData, sanitised where possible for confidentiality reasons)-
-(以下是我从PropertyData获得的电子邮件回复的摘录,出于机密原因,在可能的情况下进行了清理)-
“We pay Source 1 £XX per month. That did the trick to get us what we needed!
“我们每月向Source 1支付XX英镑。 这样做的窍门就是获得我们所需的东西!
Source 2, no amount of money makes them interested!
来源2,没有多少钱让他们感兴趣!
PropertyData”
PropertyData”
This is great information as:
这是非常有用的信息,因为:
-
It gives you an actual estimation amount from which to extrapolate data collection costs for similar providers, in the absence of any API or price points.
在没有任何API或价格点的情况下,它为您提供了一个实际的估算金额,可以从中估算出类似提供商的数据收集成本。
-
Gives you further indication of which data sources might be more feasible to work with and which ones you might avoid altogether, using the experience of others as a compass.
借助其他人的经验,进一步指示使用哪些数据源可能更可行,以及完全避免使用哪些数据源。
I always recommend taking the time to reach out to who has done it before and just ask, you might get positively surprising and helpful responses in return!
我总是建议花点时间联系以前做过的事情的人,然后再问,您可能会得到积极的惊喜和有益的回应!
运行您的估计并检查财务和技术可行性 (Run your estimations and check financial and technical feasibility)
By this point, you should have collected all the information needed to calculate the monthly running costs for data collection, which can be estimated by:
至此,您应该已经收集了计算数据收集每月运行成本所需的所有信息,可以通过以下方式进行估算:
(Number of data sources * Avg. Monthly Subscription Costs of API/Data Agreement)
(数据源数量* API /数据协议的平均每月订阅费用)
To this, you might want to factor in any Cloud Computing resources, which are going to be dependent on your data collections scripts and the amount of processing resources (time, data size driven) you are going to be utilizing to get your data into your data lake/data warehouse for later processing and analysis.
为此,您可能需要考虑 任何云计算资源 ,这些资源将取决于您的数据收集脚本和将用于将数据放入您的处理资源(时间,数据大小驱动)的数量。数据湖/数据仓库,供以后处理和分析。
Aside from the mere numbers, at this moment you should also develop a sense for the overall technical feasibility of the approach given your project set up, and whether it can make sense to proceed or to completely pivot your data collection strategy.
除了数量之外,此刻,您还应该对项目建立后的方法的整体技术可行性以及是否继续进行或完全采用数据收集策略有意义。
综上所述 (In summary)
Having a sound data collection methodology and approach can really set your data science project up and running in the best way, while getting the best possible data at the best possible price given your market domain knowledge and the data providers available.
拥有完善的数据收集方法论和方法,可以真正以最佳方式设置和运行数据科学项目,同时根据您的市场领域知识和可用的数据提供者,以最优惠的价格获得最佳的数据。
If you can:
如果你可以的话:
-
Conduct solid market research and identify the best quality sources
进行扎实的市场研究并确定最佳质量来源
-
Thoroughly check for existing’s APIs and their (usually) rich documentation
彻底检查现有的API及其(通常)丰富的文档
-
Additionally reach out to data providers to address potential data requests and their willingness to assist you
此外,还可以与数据提供商联系,以解决潜在的数据请求及其愿意为您提供帮助的意愿
-
Further increase your knowledge base by asking around to people and companies who have been given access to the data before you
通过在访问您之前先询问有权访问数据的人员和公司,进一步增加您的知识库
-
Get a fair estimation of how much time and money you are realistically going to spend to capture all the data you need
合理估算您实际上将花费多少时间和金钱来捕获所需的所有数据
You can greatly increase your chances of developing a sound approach for data collection and maximize your chances of getting great data in an efficient way. Thanks for reading!
您可以极大地提高开发合理的数据收集方法的机会,并最大限度地提高以有效方式获取优质数据的机会。 谢谢阅读!
大数据数据量估算
所有评论(0)