How Web Scraping Brings Freedom to Research

2022-08-04 08:12:43
关注

There are several stages to any academic research project, most of which differ depending on the hypothesis and methodology. Few disciplines, however, can completely avoid the data collection step. Even in qualitative research, some data has to be collected.

Unfortunately, the one unavoidable step is also the most complicated one. Good, high-quality research necessitates a ton of carefully selected (and often randomized) data. Getting all of it takes an enormous amount of time. In fact, it's likely the most time-consuming step out of the entire research project, regardless of discipline.

Four primary methods are employed when data has to be collected for research. Each of these comes with numerous drawbacks, however, some are especially troublesome:

Related: Website Scraping Is an Easy Growth Hack You Should Try

Manual data collection

One of the most tried-and-true methods is the manual collection. It's almost a foolproof method, as the researcher gets to have complete control over the process. Unfortunately, it's also the slowest and most time-consuming practice out of them all.

Additionally, manual data collection runs into issues of randomization (if required) as sometimes it might be nigh impossible to induce fairness into the set without requiring even more effort than initially planned.

Finally, manual data collection still requires cleaning and maintenance. There's too much room for possible error, especially when extremely large swaths of information need to be collected. In many cases, the collection process is not even performed by a single person, so everything needs to be normalized and equalized.

Existing public or research databases

Some universities purchase large datasets for research purposes and make them available to the student body and other employees. Additionally, due to existing data laws in some countries, governments publish censuses and other information yearly for public consumption.

While these are generally great, there are a few drawbacks. For one, university purchases of databases are led by the research intent and grants. A single researcher is unlikely to convince the financial department to get them the data they need from a vendor, as there might not be sufficient ROI to do so.

Additionally, if everyone is acquiring their data from a single source, that can cause uniqueness and novelty issues. There's a theoretical limit to the insights that can be extracted from a single database, unless it's continually renewed and new sources are added. Even then, many researchers working with a single source might unintentionally skew results.

Finally, having no control over the collection process might also skew the results, especially if data is acquired through third-party vendors. Data might be collected without having research purposes in mind, so it could be biased or only reflect a small piece of the puzzle.

Related: Using Alternative Data for Short-Term Forecasts

Getting data from companies

Businesses have begun working closer with universities nowadays. Now, many companies, including Oxylabs, have developed partnerships with numerous universities. Some businesses offer grants. Others provide tools or even entire datasets.

All of these types of partnerships are great. However, I firmly believe that providing only the tools and solutions for data acquisition is the correct decision, with grants being a close second. Datasets are unlikely to be that useful for universities for several reasons.

First, unless the company extracts data for that particular research alone, there may be issues with applicability. Businesses will collect data that's necessary for their operations and not much else. It may accidentally be useful to other parties, but it might not always be the case.

Additionally, just as with existing databases, these collections might be biased or have other issues to do with fairness. These issues might not be as apparent in business decision-making,but could be critical in academic research.

Finally, not all businesses will give away data with no strings attached. While there may be necessary precautions that have to be taken, especially if the data is sensitive, some organizations will want to see the results of the study.

Even without any ill intentions from the organization, outcome reporting bias could become an issue. Non-results or bad results could be seen as disappointing and even damaging to the partnership, which would unintentionally skew research.

Moving on to grants, there are some known issues with them as well. However, they are not as pressing. As long as studies are not completely funded by a company in a field in which it is involved, publishing biases are less likely to occur.

In the end, providing the infrastructure that will allow researchers to gather data without any overhead, other than the necessary precautions, is the least susceptible to biases and other publishing issues.

Related: Once Only for Huge Companies, 'Web Scraping' Is Now an Online Arms Race No Internet Marketer Can Avoid

Enter web scraping

Continuing off my previous thought, one of the best solutions that a business can provide researchers with is web scraping. After all, it's a process that enables automated data collection (in either raw or parsed formats) from many disparate sources.

Creating web scraping solutions, however, takes an enormous amount of time, even if the necessary knowledge is already in place. So, while the benefits for research might be great, there's rarely a good reason for someone in academia to get involved in such an undertaking.

Such an undertaking is time-consuming and difficult even if we discount all the other pieces of the puzzle — proxy acquisition, CAPTCHA solving and many other roadblocks. As such, companies can provide access to the solutions to allow researchers to skip through the difficulties.

Building up web scrapers, however, would not be essential if the solutions wouldn't play an important part in the freedom of research. With all the other cases I've outlined above (outside of manual collection), there's always the risk of bias and publication issues. Additionally, researchers are then always limited by one or other factors, such as the volume or selection of data.

With web scraping, however, none of these issues occur. Researchers are free to acquire any data they need and specialize it according to the study they are conducting. The organizations involved with the provision of web scraping also have no skin in the game, so there's no reason for bias to appear.

Finally, as so many sources are available, the doors are wide open to conduct interesting and unique research that otherwise would be impossible. It's almost like having an infinitely large dataset that can be updated with nearly any information at any time.

In the end, web scraping is what will allow academia and researchers to enter a new age of data acquisition. It will not only ease the most expensive and complicated process of research, but it will also enable them to break off from the conventional issues that come with acquiring data from third parties.

For those in academia who want to enter the future earlier than others, Oxylabs is willing to join hands in helping researchers with the pro bono provisions of our web scraping solutions.

参考译文
网页抓取如何为研究带来自由
任何学术研究项目都包含多个阶段,其中大多数会因研究假设和方法而有所不同。然而,几乎没有哪个学科可以完全避免数据收集这一环节。即便是质性研究,也必须收集一些数据。不幸的是,这一步骤恰恰也是最复杂的一项。高质量的研究需要大量精心挑选(通常是随机选取)的数据,而收集这些数据耗时巨大。事实上,它很可能是整个研究项目中最耗时的步骤,无论研究领域是什么。在研究中,数据收集主要依赖四种基本方法。然而,每种方法都有其缺点,有些尤为棘手:相关阅读:网站抓取是一种你该尝试的简单增长黑客手段**手动数据收集** 手动数据收集是一种经得起考验的方法。研究员可以完全掌控整个过程,这使得它几乎万无一失。然而,这种方法也是所有方法中最慢且最耗时的。 此外,手动数据收集在需要随机化时会遇到难题。有时要确保数据集的公平性几乎是不可能的,除非付出比原计划更多的努力。 最后,手动数据收集仍需要数据清洗和维护。错误发生的可能性太大,尤其是当需要收集大量信息时。很多时候,数据收集并非由同一人完成,因此还需要进行统一化和标准化处理。**现有公共或研究数据库** 一些大学会为研究目的购买大量数据集,并向学生和其他教职员工开放。另外,由于一些国家存在相关数据法规,政府每年也会公布人口普查和其他信息供公众使用。 虽然这些数据库通常都很不错,但仍存在一些问题。首先,大学购买数据库通常是基于研究目的和资助情况。单个研究员很难说服财务部门从供应商那里购买他们需要的数据,因为可能缺乏足够的投资回报率。 此外,如果每个人都从同一来源获取数据,就可能导致数据缺乏独特性和创新性。除非数据库不断更新并添加新来源,否则能从中提取的洞见是有理论上限的。即便如此,很多研究人员使用单一来源时,也可能会无意间导致结果偏差。 最后,如果无法控制数据收集过程,同样可能导致结果偏差,特别是如果数据是通过第三方供应商获取的。这些数据可能是为了非研究目的而收集的,因此可能存在偏差,或者仅反映问题的一个小部分。**相关阅读:使用替代数据进行短期预测****从公司获取数据** 如今,企业开始与大学合作得越来越紧密。目前,许多公司,包括Oxylabs,已与众多大学建立了合作伙伴关系。一些公司提供资助,还有一些提供工具,甚至是完整的数据集。 所有这些合作形式都很不错。但我坚定地认为,为数据获取提供工具和解决方案是最正确的选择,其次才是提供资助。原因在于,数据集对于大学来说很可能并不那么有用。 首先,除非公司专门为某项研究提取数据,否则可能会存在适用性的问题。企业收集的数据主要是为了自身运营,而不会超出此范围太远。这些数据可能偶然对其他方也有用,但未必总是如此。 此外,和现有数据库一样,这些数据集也可能是有偏见的,或存在公平性相关的问题。这些问题在企业决策中可能并不那么明显,但在学术研究中却可能是关键的。 最后,并非所有企业都会无条件地提供数据。尽管在数据敏感的情况下,必须采取必要的预防措施,但一些机构仍希望看到研究成果。 即便没有恶意,结果报告偏差也可能成为问题。没有结果或负面结果可能被视为令人失望,甚至对合作关系造成损害,从而无意间影响研究的客观性。 至于资助方面,确实也存在一些众所周知的问题,但它们并不那么严重。只要研究并非完全由一家相关企业资助,出版偏差的可能性就较低。 最终,为研究人员提供获取数据所需的基础设施,除了必要的预防措施之外,不带来任何额外负担,这才是最不容易出现偏差和出版问题的解决方案。**相关阅读:曾经只属于大型企业的“网络爬虫”技术,如今已成为每位网络营销人员都无法回避的“在线军备竞赛”****引入网络爬虫** 继续刚才的思路,企业能为研究人员提供的一项最佳解决方案就是网络爬虫。毕竟,它是一种可以从众多不同来源自动收集数据的流程(无论是原始格式还是解析后的格式)。 然而,即使研究人员已经具备必要的知识,创建网络爬虫解决方案仍需要耗费大量时间。因此,尽管网络爬虫对研究的好处巨大,学术界很少有人愿意涉足这样的项目。 即使我们忽略其他问题(如代理获取、验证码破解等众多障碍),此类项目也极为耗时且困难。因此,企业可以提供访问解决方案的权限,帮助研究人员绕过这些困难。 然而,如果这些解决方案对研究自由没有重要影响,那么构建网络爬虫可能并非必需。在上述所有情况中(除了手动收集),总是存在偏差和出版问题的风险。此外,研究人员总是受到某些因素的限制,例如数据的数量或选择范围。 但通过网络爬虫,这些问题都不会出现。研究人员可以自由获取他们需要的任何数据,并根据正在进行的研究进行专业化处理。提供网络爬虫服务的组织没有利益相关方,因此没有理由产生偏见。 最后,由于有大量来源可供选择,这为开展有趣且独特的研究打开了大门,而这些研究否则是无法进行的。这就像拥有一个无限大的数据集,随时可以更新几乎任何信息。 最终,网络爬虫将使学术界和研究人员迈入数据采集的新时代。它不仅会减轻研究中最为昂贵和复杂的流程,还将使他们摆脱从第三方获取数据所面临的传统问题。 对于希望比别人更早进入未来的学术界人士,Oxylabs愿意携手合作,为研究人员免费提供我们的网络爬虫解决方案。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

广告
提取码
复制提取码
点击跳转至百度网盘