4.Īpache Kafka is a data collection platform, which offers three methods of batch streaming: the stream API makes it possible to collect massive streaming data and stores it in topics, the producer API publishes streaming data for one or more subjects, the connector API-this is an API for application access connections.Īfter citing most of the tools from the big data framework that provide the different data collection modes of heterogeneous platforms, we will detail in the next section the data processing phase.Īnshula Sharma. The latter allows microbatch processing and storage of all massive data to anyone with a big data storage system with RDD support (resilient distributed dataset). Spark integrates streaming processing with the Spark streaming API. 3.Īpache Spark is a framework which has as a basic principle, the treatment of all types of big data. It does the storage by exploiting NoSQL or HDFS storage. 2.Īpache Chukwa is based on the collection of large-scale system registers. So we can group these systems into two categories, either in real time, that is, hot extraction, even if the system is in full production, or with batch and microbatch, in a well-determined time, for small amounts of data, which requires shutting down the production system.Īll of the above features are implemented by the tools of the big data framework ( Acharjya and Ahmed, 2016): 1.Īpache Flame interacts with log data by collecting, aggregating, and moving a large amount of data with great robustness and fault tolerance. This operation can be done in microbatch, and in two modes, either the system is hot or stopped. Stream processing: stream loading tools are growing every day, with the appearance of new APIs (application programming interface). This transformation is carried out by the algorithms of MapReduce. A third collection method is the Spooq mode, which allows the step of import/export of data from a relational database to storage based on big data, either in HDFS file format of Hadoop, or to NOSQL tables. This method responds effectively to the process of extracting and importing big data. To this end, the system creates a network of nodes for the synchronization of big data. The second mode is based on ETL techniques (extract, transform, load). The first mode concerns the collection of massive data done locally then integrated successively in our storage system. 1.īatch processing: the big data framework has three modes of data collection. To integrate large volumes of data from the building blocks of companies’ information systems, ETLs, enterprise application integration (systems) (EAIs), and enterprise information integration (EIIs) are always used ( Acharjya and Ahmed, 2016 Daniel, 2017). Thus certain integration tools, including a big data adapter, already exist on the market this is the case with Talend Enterprise Data Integration–Big Data Edition. Hadoop uses scripting via MapReduce Sqoop and Flume also participate in the integration of unstructured data. In the big data context, data integration has been extended to unstructured data (sensor data, web logs, social networks, documents). The big data collection phase can be divided into two main categories, which depend on the type of load, either batch, microbatch, streaming. Then we will look at the different formats for storing structured and unstructured data. In what follows, we will make a comparative study of the tools that make this collection operation with respect to the norms and standards of big data. The data are then stored in the HDFS file format or NOSQL database ( Prabhu et al., 2019). Indeed, in this phase the big data system collects massive data from any structure, and from heterogeneous sources by a variety of tools. Integration consists all the data into the big data storage. 6.Ĭompression consists of reducing the size of the data, but without losing the relevance of the data. The transformation can lead to the division, convergence, normalization, or synthesis of the data. Noise reduction or removing involves cleaning data. 3.Ĭonstant validation and analysis of data. 2.įiltration and selection of incoming information relevant to the business. Identification of the various known data formats, by default big data targets the unstructured data. The components in the loading and collection process are: 1.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |