1 / 5100%
Sinceh Ih haveh ah backgroundh andh experienceh inh cloudh operations,h Ih startedh myh
researchh byh lookingh ath theh toolsh availableh fromh theh majorh cloudh serviceh
providersh whoh competeh withh IBM.h Ih reviewedh Bigh Datah serviceh offeringsh andh
toolsh fromh Alibaba,h Amazon,h Google,h andh Microsoft.
Alibabah hash ah solutionh calledh Hologres,h ah real-timeh datah warehouseh
compatibleh withh Postgresh SQL.h Theyh alsoh offerh anh analyticsh producth calledh
MaxCompute,h enablingh oneh toh queryh andh analyzeh datah withh lowh latencyh andh
highh concurrency.h Oneh canh utilizeh MaxComputeh toh analyzeh logs,h e-
commerceh transactions,h explorationh ofh userh profileh datah andh behavior,h andh
otherh datah analyticsh projectsh h (Whath ish MaxCompute?h 2022).h Alibabah alsoh
hash ah producth namedh QuickBIh toh performh datah analytics,h exploration,h andh
reporting.
Amazonh offersh severalh differenth datah analyticsh servicesh thath supporth
computingh withh ah largeh volumeh ofh unstructuredh andh structuredh datah withh lowh
latency.h Oneh appealingh offeringh ish Amazonh EMR.h Thish offeringh enablesh oneh
toh runh Bigh Datah workloadsh suchh ash Apacheh Spark,h Hive,h andh Prestoh usingh
complementaryh Amazonh servicesh suchh ash Amazonh EC2h clustersh (Amazonh
EMR,h n.d.).
Sinceh weh haveh noth studiedh Prestoh inh thish class,h furtherh researchh revealedh thath
Prestoh ish anh open-sourceh distributedh SQLh queryh engine.h Prestoh ish ah
community-ownedh projecth byh theh Prestoh Foundation.h (Whath ish Presto?h n.d.).h
Ih foundh ith interestingh thath oneh canh useh Prestoh toh queryh datah whereh ith residesh
(suchh ash Cassandra,h Hive,h andh traditionalh SQLh databases),h andh Facebookh alsoh
usesh Prestoh toh runh queriesh againsth itsh 300h PBh (petabyte)h datah warehouseh
(Whath ish Presto?h n.d.).h Fromh ah glanceh ath theh Prestoh documentation,h ith looksh
likeh oneh canh useh thish toolh ash anh alternativeh toh Apacheh Pigh andh Hiveh (Useh
Cases,h n.d.).h Oneh benefith ofh usingh thish toolh ratherh thanh Apacheh Pigh orh Hiveh ish
thath oneh ish noth limitedh toh justh HDFSh queries.h Evenh thoughh thish toolh originatedh
ath Facebook,h theh documentationh indicatesh thath Facebookh andh theh open-sourceh
communityh continueh developingh theh toolh (Useh Cases,h n.d.).
Thereh areh otherh computingh serviceh offeringsh thath oneh canh leverageh inh Bigh
Datah analyticsh projects,h suchh ash Amazonh Athenah (toh queryh Amazonh S3h
storage),h Amazonh RedShifth (Datah Warehousing),h andh Amazonh Kinesish (real-
timeh videoh andh streamh analysis).h Amazonh QuickSighth (businessh analytics),h
Amazonh OpenSearchh (texth andh unstructuredh datah search),h andh AWSh Glueh
DataBrewh (datah cleaningh andh normalization)h roundh outh theirh datah analyticsh
offeringh (Analyticsh onh AWS,h n.d.).
Ih alsoh noticedh thath Amazonh providesh severalh tutorialsh onh usingh theh serviceh
thath oneh couldh findh helpfulh sinceh ah bigh parth ofh Bigh Datah analyticsh ish gettingh
theh datah ingestedh intoh ah datah storeh toh analyzeh theh data.h Ih likedh thath Amazon'sh
documentationh appearsh toh beh thorough,h mature,h andh fullh ofh examplesh whereh
oneh canh takeh advantageh ofh theirh serviceh offerings.h Amazonh alsoh providesh useh
casesh andh customerh successh storiesh toh helph explainh howh theirh serviceh
offeringsh fith togetherh toh provideh ah completeh Datah Analyticsh stack.h Ash anh
example,h Amazonh explainsh howh Pearsonh usedh Amazonh OpenSearchh toh
analyzeh andh gainh insightsh fromh logh datah thath ith collectsh toh improveh theh
company'sh learningh platformh (Pearsonh Boostsh Securityh andh Productivityh
Usingh Amazonh OpenSearchh Service,h 2020).
Noth surprisingly,h Googleh alsoh hash manyh serviceh offeringsh thath provideh
computeh capacityh toh handleh ah largeh volumeh ofh unstructuredh andh structuredh
datah withh lowh latency.h Theh centerpieceh ofh theirh Bigh Datah serviceh offeringh
seemsh toh beh Googleh BigQuery,h whichh Googleh claimsh toh haveh ah 26h –h 34%h
lowerh Totalh Costh ofh Ownershiph overh threeh yearsh thanh cloudh datah warehouseh
alternativesh (BigQuery,h n.d.).h BigQueryh alsoh hash varioush clienth librariesh
available.h Ah datah analysth canh leverageh familiarh programmingh languagesh suchh
ash Python,h Java,h Javascript,h andh Goh toh writeh andh executeh queriesh toh gainh
valuableh insightsh fromh Bigh Datah (Whath ish BigQuery?h n.d.).
Oneh complimentaryh offeringh fromh Googleh thath soundsh interestingh ish theirh
BigQueryh Omnih product,h whichh enablesh datah analysish acrossh cloudsh suchh ash
Microsofth Azureh andh Amazonh Webh Services.h Oneh tremendoush benefith toh
Microsofth andh Amazonh customersh ish thath theyh doh noth haveh toh moveh theirh datah
toh Googleh Cloudh toh takeh advantageh ofh Googleh BigQueryh (BigQueryh Omni,h
n.d.).h Ash ah formerh Microsofth Azureh administrator,h Ih canh certainlyh appreciateh
Google'sh interesth inh servingh customersh toh operateh resourcesh andh servicesh inh
oneh orh moreh publich clouds.
Lasth buth noth least,h Microsofth hash severalh serviceh offeringsh toh supporth
computingh withh ah largeh volumeh ofh unstructuredh andh structuredh datah withh lowh
latency.h Ith seemsh likeh theirh serviceh offeringsh focush onh eitherh Azureh HDInsighth
orh Azureh Databricks.h Azureh HDInsighth essentiallyh enablesh oneh toh leverageh ah
customizableh environmenth ofh Hadooph componentsh distributedh inh Microsoft'sh
cloudh (Whath ish Azureh HDInsight?h 2021).h Ih foundh thish offeringh appealingh
sinceh oneh canh scaleh ah workloadh uph andh downh andh onlyh payh forh whath oneh usesh
(similarh toh IBM'sh Watsonh Studioh cloudh offering).h Dependingh onh businessh
need,h oneh canh alsoh selecth fromh anyh numberh ofh clusterh typesh toh deployh inh
Microsofth Azure.h Forh example,h oneh couldh createh anh Apacheh Sparkh clusterh toh
supporth ah Bigh Datah analyticsh projecth whereh in-memoryh processingh ish vitalh forh
performanceh reasons.h Alternatively,h oneh couldh createh anh Apacheh Stormh
clusterh ifh largeh datah streamsh needh toh beh ingestedh andh analyzedh quickly.
Ih foundh ith challengingh toh reviewh someh ofh theseh cloud-basedh serviceh offeringsh
becauseh ith ish hardh toh understandh whath Bigh Datah technologyh ish behindh eachh
offering.h Ith seemsh likeh someh ofh theh cloudh vendorsh haveh repurposedh open-
sourceh toolsh forh theh cloudh toh makeh ith easyh forh customersh toh consumeh capacityh
onh ah pay-as-you-goh basis,h whileh othersh haveh builth theirh ownh proprietaryh toolsh
toh handleh Bigh Datah needs.h
Ih thinkh oneh needsh toh beh naturallyh curioush andh determinedh toh learnh moreh toh
understandh theh underpinningsh ofh eachh one.h Oneh couldh spendh hoursh readingh
abouth theh differenth featuresh andh experimentingh withh capabilities.h Forh
example,h Ih spenth someh timeh readingh abouth Amazonh Kinesish whichh providesh
someh excitingh real-timeh datah analysish capabilitiesh withh otherh Amazonh serviceh
offerings.h Ih learnedh thath Zillowh usesh Kinesish Datah Streamsh toh collecth publich
datah andh Multipleh Listingh Serviceh (MLS)h servicesh whichh helpsh theh companyh
keeph buyersh andh sellersh updatedh onh estimatedh homeh valuesh (Amazonh Kinesis,h
n.d.).h Thish capabilityh soundsh likeh anh Apacheh Kafkah orh Flumeh feature,h soh Ih
spenth someh timeh readingh theh Kinesish Datah Streamsh developerh documentationh
toh seeh whath oneh canh doh withh theh service.
Ith isn'th easyh toh renderh anh informedh opinionh onh whichh oneh ish besth withh allh theh
availableh toolsh choices.h Myh favoriteh answer,h ash ah formerh consultant,h ish -h "ith
depends".h Ih thinkh oneh needsh toh carefullyh assessh theh businessh requirementsh forh
anyh Bigh Datah analyticsh projecth andh thenh evaluateh theh abilityh ofh varioush Bigh
Datah platforms,h tools,h andh serviceh providersh toh meeth theh requirements.h Allh
kindsh ofh questionsh comeh toh mind:h Doesh oneh needh toh analyzeh datah inh real-timeh
orh usingh batchh processing?h Howh muchh diskh storageh ish requiredh toh storeh theh
datah nowh andh inh theh future?h Whoh needsh toh accessh theh data,h andh whath ish theh
datah governanceh strategyh toh ensureh theh datah ish properlyh secured?h Doesh theh
datah needh toh beh encryptedh ath rest?h Howh responsiveh doh queriesh needh toh be?h
Whath kindh ofh reportsh willh oneh needh toh generateh andh distribute?h Ish theh datah
structured,h semi-structured,h unstructured,h orh someh combinationh ofh allh three?h
Willh ah datah warehouseh sufficeh toh storeh theh data,h orh ish ah moreh extensibleh datah
lakeh withh datah originatingh fromh disparateh datah sourcesh required?h Theh listh goesh
onh andh on,h buth thoseh areh ah fewh questionsh oneh mayh needh toh answerh beforeh
selectingh Bigh Datah tools,h platforms,h andh ah cloudh serviceh provider.
Lastly,h myh experienceh withh product/tool/platformh evaluationsh hash beenh thath
someh vendorsh willh practicallyh fallh overh themselvesh toh earnh yourh businessh andh
trust.h Forh example,h ourh agencyh wash interestedh inh determiningh howh wellh oneh
vendor'sh machineh learningh platformh wouldh identifyh fraudulenth transactionsh inh
ah dataseth weh hadh available.h Weh invitedh theh vendorh toh workh withh ush sideh byh
sideh inh ah sorth ofh "hackathon"h whereh weh hackedh throughh theh businessh problemh
andh developedh ah machineh learningh modelh usingh theh vendor'sh platform.h Thath
exerciseh gaveh ush ah goodh understandingh ofh theh vendor'sh capabilitiesh andh ah
baselineh againsth whichh weh couldh compareh productsh fromh otherh vendors.h
Whenh ith comesh toh vendorsh wantingh toh showh offh shinyh newh toysh comingh outh ofh
theirh respectiveh researchh labs,h thereh ish typicallyh noh shortageh ofh enthusiasmh
fromh theh vendor'sh producth managers.
Ih wouldh beh interestedh inh hearingh whath toolsh othersh inh ourh classh haveh
discoveredh forh computingh withh ah largeh volumeh ofh unstructuredh andh structuredh
withh lowh latency.
Veryh Respectfully,
Michaelh Goddard
h References:
"Amazonh EMR"h (n.d.).h Amazon.com.h Retrievedh fromh
https://aws.amazon.com/emr/?c=a&sec=srv
"Amazonh Kinesis"h (n.d.).h Amazon.com.h Retrievedh fromh
https://aws.amazon.com/kinesis/?c=a&sec=srv
"BigQueryh Omni"h (n.d.)h Google.com.h Retrievedh fromh
https://cloud.google.com/bigquery-omni/docs/introduction
"Pearsonh Boostsh Securityh andh Productivityh Usingh Amazonh OpenSearchh
Service"h (2020).h Amazon.com.h Retrievedh fromh
https://aws.amazon.com/solutions/case-studies/pearson-elasticsearch-case-
study/
"Useh Cases"h (n.d.).h Presto.io.h Retrievedh fromh https://prestodb.io/
"Whath ish Azureh HDInsight?"h (2021,h Novemberh 18).h Microsoft.com.h
Retrievedh fromh https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-
overview
"Whath ish BigQuery?"h (n.d.).h Google.comh Retrievedh fromh
https://cloud.google.com/bigquery/docs/introduction
"Whath ish MaxCompute?"h (2022,h Januaryh 28).h Alibaba.h Retrievedh fromh
https://www.alibabacloud.com/help/en/doc-detail/27800.htm
"Whath ish Presto?"h (n.d.).h Presto.io.h Retrievedh fromh https://prestodb.io/
Students also viewed