Fast deduplication data transmission scheme on a big data real-time platform

Sheng-Tzong Cheng, Jian Ting Chen, Yin Chun Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - "Spark Streaming" optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.

Original languageEnglish
Title of host publicationBMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design
EditorsBoris Shishkov
PublisherSciTePress
Pages155-166
Number of pages12
ISBN (Electronic)9789897582387
Publication statusPublished - 2017 Jan 1
Event7th International Symposium on Business Modeling and Software Design, BMSD 2017 - Barcelona, Spain
Duration: 2017 Jul 32017 Jul 5

Publication series

NameBMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design

Other

Other7th International Symposium on Business Modeling and Software Design, BMSD 2017
CountrySpain
CityBarcelona
Period17-07-0317-07-05

Fingerprint

Data Transmission
Data communication systems
Real-time
Data storage equipment
Computing
Electric sparks
MapReduce
Streaming
Distributed computer systems
Distributed Computing
Throughput
Big data
Eliminate
Bandwidth
Sensors
Sensor
Optimization
Costs
Vertex of a graph

All Science Journal Classification (ASJC) codes

  • Modelling and Simulation
  • Software

Cite this

Cheng, S-T., Chen, J. T., & Chen, Y. C. (2017). Fast deduplication data transmission scheme on a big data real-time platform. In B. Shishkov (Ed.), BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design (pp. 155-166). (BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design). SciTePress.
Cheng, Sheng-Tzong ; Chen, Jian Ting ; Chen, Yin Chun. / Fast deduplication data transmission scheme on a big data real-time platform. BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design. editor / Boris Shishkov. SciTePress, 2017. pp. 155-166 (BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design).
@inproceedings{4d2514552a3a45118cec21a5ae90b2a3,
title = "Fast deduplication data transmission scheme on a big data real-time platform",
abstract = "In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - {"}Spark Streaming{"} optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.",
author = "Sheng-Tzong Cheng and Chen, {Jian Ting} and Chen, {Yin Chun}",
year = "2017",
month = "1",
day = "1",
language = "English",
series = "BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design",
publisher = "SciTePress",
pages = "155--166",
editor = "Boris Shishkov",
booktitle = "BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design",

}

Cheng, S-T, Chen, JT & Chen, YC 2017, Fast deduplication data transmission scheme on a big data real-time platform. in B Shishkov (ed.), BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design. BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design, SciTePress, pp. 155-166, 7th International Symposium on Business Modeling and Software Design, BMSD 2017, Barcelona, Spain, 17-07-03.

Fast deduplication data transmission scheme on a big data real-time platform. / Cheng, Sheng-Tzong; Chen, Jian Ting; Chen, Yin Chun.

BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design. ed. / Boris Shishkov. SciTePress, 2017. p. 155-166 (BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Fast deduplication data transmission scheme on a big data real-time platform

AU - Cheng, Sheng-Tzong

AU - Chen, Jian Ting

AU - Chen, Yin Chun

PY - 2017/1/1

Y1 - 2017/1/1

N2 - In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - "Spark Streaming" optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.

AB - In this information era, it is difficult to exploit and compute high-amount data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time let alone real time. Hence, In-memory Computing (IMC) was introduced to solve the problem of Hadoop MapReduce. IMC, as its literal meaning, exploits computing in memory to tackle the cost problem which Hadoop undue access data to disk caused and can be distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed of receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution. The technique for eliminating duplicated data is capable of improving data utilization. This study presents a distributed real-time IMC platform - "Spark Streaming" optimization. It uses deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.

UR - http://www.scopus.com/inward/record.url?scp=85032964984&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032964984&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85032964984

T3 - BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design

SP - 155

EP - 166

BT - BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design

A2 - Shishkov, Boris

PB - SciTePress

ER -

Cheng S-T, Chen JT, Chen YC. Fast deduplication data transmission scheme on a big data real-time platform. In Shishkov B, editor, BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design. SciTePress. 2017. p. 155-166. (BMSD 2017 - Proceedings of the 7th International Symposium on Business Modeling and Software Design).