TY - GEN
T1 - Fast deduplication data transmission scheme on a big data real-time platform
AU - Cheng, Sheng Tzong
AU - Chen, Jian Ting
AU - Chen, Yin Chun
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2017
Y1 - 2017
N2 - In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - "Spark Streaming" optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.
AB - In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - "Spark Streaming" optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.
UR - http://www.scopus.com/inward/record.url?scp=85032964984&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032964984&partnerID=8YFLogxK
U2 - 10.5220/0006528401550164
DO - 10.5220/0006528401550164
M3 - Conference contribution
AN - SCOPUS:85032964984
T3 - BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design
SP - 155
EP - 166
BT - BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design
A2 - Shishkov, Boris
PB - SciTePress
T2 - 7th International Symposium on Business Modeling and Software Design, BMSD 2017
Y2 - 3 July 2017 through 5 July 2017
ER -