Final paper on “Measures to Reduce Data Dump in Minimal Time”

profileEmir7
vamsifinalpaper.docx

8

Literature Review on “Measures to Reduce Data Dump in Minimal Time”

Student Name

Department, Institution

Course Title

Instructor Name

Due Date

Introduction

Data dumping is play an integral part of the operations of different entities and their systems. The services that are involved in production and serving customers are majorly impacted by data dumping. It is therefore that this reduce this process for the benefit of all stakeholders. The research problem is established on determining the most effective methods and practices of enhancing data dumping. It integrates a number of literature sources to inform on these strategies. These are key aspects for discussion given the substantial reliance on big data sets. The main components in this paper include data processing and storage frameworks, data protection and preservation. Comparison and contrast of the strategies provides perspective on the appropriate ways of addressing issues to with data dumping. With the continued reliance on transfer of large data amounts from system to system over a networks, it is imperative that parties employ suitable means of addressing data dumping.

Literature Review

The rapid production of working software, close collaboration with customers and quick response to the change in customer requirements requires minimal data dumping time. The three research articles involved methodologies, findings, and solutions for data dumping in the least time possible. Different authors conduct studies, and this literature review will only focus on comparing and contrasting the three important sections. This review is necessary to compare and contrast elements that the methodology, findings, and recommendations include. Therefore, this literature review analyses the three resources to find the significant similarities and differences in each of the three sections.

Munappy et al. (2020) conducted research with a methodology similar to that of Qayyum (2020). Both studies conducted empirical reviews of previous literature concerning how organizations rely on big data. The existing literature was informative about how organizations put big data into use. Qayyum (2020) performed initial research about DataOps in various databases like IEEE Explore, ACM digital library, ScienceDirect, Google Scholar, Web of Science, and Scopus. Also, the third article that Machado et al. (2019) wrote a little earlier also had the same method of research. The authors used secondary sources to understand the integration of resources, mastering data overheads, degradation of performance, and backing up data. The three research articles considered a similar methodology of analyzing several existing resources.

Qayyum (2020) found out that interviewees associated Hadoop with terms like automation, collaboration, and integration. Hadoop is a solution for the ever-growing population of young but exposed generations. At the same time, Machado et al. (2019) found out that DOD-ETL has features like horizontal scalability, high availability, and low latency. She further explained that Spark becomes faster when customized using DOD-ETL. Similar to Qayyum (2020) and Machado et al. (2019), Munappy et al. (2020) found that DOD-ETL is beneficial as it reduces the end-to-end cycle time, creating more meaningful accusations. Therefore, these articles provide a solution for handling big data and reducing the time taken to complete data dumping.

Machado et al. (2019) recommended future analysis of DOD-ETL to determine certain variabilities. The original experiment was based on heavyweight frameworks, and future studies need to compare lightweight Stream Processing frameworks like Kafka Streams and Samza. In the same way, Munappy et al. (2020) outline that future research should include those interviews that were excluded from the current study. Also, Qayyum (2020) provided the solution of Hadoop to help enhance real-time streaming. The three authors each recommend further analysis through reliance on peer-reviewed resources rather than grey sources. They each provide solutions for the future of data dumping problems in future.

The research by Munappy et al. (2020) is different from that of Qayyum (2020) because they used almost similar but different methodologies. Despite both groups of authors performing secondary research on other sources, Munappy et al. (2020) went ahead to conduct semi-structured interviews. Those interviews comprised a total of 45 questions, with six categories in total. The resource by Machado et al. (2019) conducted a similar online search for sources. Contrasted with Munappy et al. (2020) and Qayyum (2020) who conducted only secondary research, Machado et al. (2019) took a step further to conduct interviews that boost the company. Therefore, the documents differ in the kind of sources that authors used to find data and information.

DataOps is a data strategy by Qayyum (2020) for analytics because it reduces the cycle of end-to-end. Qayyum (2020) shows that one can greatly utilize the opportunities of big data, and Hadoop as a solution for large data dumping processed. On the contrary, Machado et al. (2019) showed that DOD-ETL customizations have no negative impact on the fault tolerance and scalability of Spark Streaming. In other words, DOD-ETL techniques and strategies help to reduce the run time of ETL. This model outperforms a modern framework for Stream Processing. The findings of the articles show that each researcher has a different way of measuring data dumping.

The articles provide that further research should happen to include those factors that were left out in their current studies. Each of the article gave dissimilar results through recommending different technologies to use. In contrast, each of the three articles provides a different solution for the issue of slow data dumping. While Machado et al. (2019) present the DOD-ETL solution, Munappy et al. (2020) and Qayyum (2020) present DataOps and Hadoop Distributed File System (HDFS) respectively. The literature review involved three research articles with similarities and differences in the methodology, findings, and recommendations.

Discussion

Pogue (2019) presented critical insights regarding how people may “cleanse” data. When humans need to checkup on their cars, finances, or health, they usually find mechanics, accountants, or doctors. However, their digital lives are usually left unattended as many people can stay for years of disregarding their data, particularly the one stored digitally. Storage devices such as hard drives are bound to be damaged, even manufacturers are aware of that fact and indicate it in the manual’s fine print as “M.T.B.F.,” which stands for “mean time between failures.” Therefore, if an individual’s drive stores only copies of photos and other files, the implication is that all that data is on the verge of being lost. Therefore, there is a need for a continuous and automatic back up system. According to Munappy et al. (2020), only 6 percent of people have set up a backup system for their digital files. It is important to consider offsite backups like the cloud, which is immune to such natural catastrophes as fiendish burglars, floods, and fire. Pogue (2019) clearly points to the issue of data dump whereby people have not been prudent enough to backup data in safer locations that cannot be easily compromised.

Mitigation

Reduction of data dump can be seamlessly achieved when appropriate measures are taken into account, which includes protection and preservation of precious data that has already been cleaned up. Data users need to secure their data through backup, which should be accompanied by periodic and regular checks and tests. Data dump may be prevented through appropriate investments in backup solutions and creation of a backup schedule. According to a study by Machado et al. (2019), 60 percent of people utilize a backup solution to cushion themselves from potential data loss, but the data is usually compromised due to lack of updated backup solution or the backup is faulty. Apart from data security, data dump may be significantly lowered through proper data organization. This includes the formation of a neat folder structure that is properly organized and file names are logically arranged, which may help in identifying important record and documents faster and more efficiently. Organization of data prevents potential data loss through accidentally deleting files. Data users may need to create a “dump” folder aimed at storing all uncertain data. Reducing data dump can also be achieved through safe cleaning and protection of hardware. According to Qayyum, (2020), dust is one of the components that negatively influence the cooling system of a computer, thus causing the device to overheat. This may lead to crashing of the operating system and, consequently, potential data loss. It is advised that data users and device owners should utilize the necessary cleaning products for electronic devices and avoid drinking or eating near the computer.

In most cases, people do not always have control over the type or format of data they import from an external source of data like a web page, text file, or even a database. Therefore, it is recommended that before data analysis, the data is cleaned up. The data cleaning process may leverage such software as Microsoft Excel and Spell Checker, which helps in precisely formatting the data and clean up misspelled words in every column containing descriptions or comments. The Remove Duplicates dialog box may help in removing duplicate rows. Another recommendation is the need to create a backup copy of the original data in a different workbook. Data users should be keen towards ensuring that the data is in a tabular format of columns and rows with no blank rows within range, all rows and columns visible, and similar data in every column. This can be easily achieved through the use of an Excel table. In dealing with the problem of data dump, it is important to first execute tasks that do not need manipulation of columns like checking spelling or the use of Find and Replace dialogue box. This step should be followed by tasks that do not need the columns to be manipulated. Column manipulation includes such actions as inserting a new column next to the preceding one that requires to be cleaned. A formula that is capable of transforming the data at the top of new column also needs to be added. By end of the process, the new data result is “clean,” and it is interesting to note that the process takes very minimal time. The clean data should always be backed up, particularly in the cloud to prevent data loss, which may pose huge negative implications to the data user. When encountering data loss, some people tend to attempt data recovery through own at-home remedies, which may cause more harm than good. Therefore, it is recommended that an individual should immediately contact a data recovery professional for assistance.

Conclusion

Research has indicated that there are various considerations that are effective in influencing data dumping. They are all impactful in positively affecting data processing, transfer, storage and preservation. The main factors are based on use of these aforementioned areas. The variety of options show just how vital it is to find ways of dealing with data dumping. Different studies have emphasized on different ways to go but evidently they are all useful. Going forward, there is need for more attention on integration of favorable practices in order to make these processes more efficient. Being proactive should make data processing and protection more effective.

References

Machado, G. V., Cunha, Í., Pereira, A., & Oliveira, L. B. (2019). DOD-ETL: distributed on-demand ETL for near real-time business intelligence.  Journal of Internet Services and Applications10(1), 1-15. https://doi.org/10.1186/s13174-019-0121-z

Munappy, A. R., Mattos, D. I., Bosch, J., Olsson, H. H., & Dakkak, A. (2020, June). From ad-hoc data analytics to dataops. In  Proceedings of the International Conference on Software and System Processes (pp. 165-174). https://doi.org/10.1145/3379177.3388909

Pogue, D. (2019). How to do a data “cleanse.” New York Times. https://www.nytimes.com/2019/02/01/smarter-living/how-to-do-a-data-cleanse.html

Qayyum, R. (2020). A roadmap towards big data opportunities, emerging issues and hadoop as a solution.  Rida Qayyum." A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Solution", International Journal of Education and Management Engineering (IJEME)10(4), 8-17. https://j.mecs-press.net/ijeme/ijeme-v10-n4/IJEME-V10-N4-2.pdf