Monday, May 28, 2012

Why HPCC is a superior alternative to Hadoop

Enterprise Ready

  • Batteries included: All components are included in a consistent and homogeneous platform – a single configuration tool, a complete management system, seamless integration with existing enterprise monitoring systems and all the documentation needed to operate the environment is part of the package. HPCC has been designed with operability in mind, and includes all the required instrumentation to professionally manage and monitor the entire system. The fact that no additional third party components are required, simplifies the implementation and eliminates complexities that arise from heterogeneous platforms such as Hadoop.
  • Backed by over 10 years of experience: The HPCC platform is the technology underpinning LexisNexis data offerings, serving multi-billion dollars critical 24/7 business environments with the strictest SLA's. In addition to the internal experience, over the years we also sold several HPCC turn-key systems to traditional large enterprises, law enforcement and the intelligence community. While the HPCC platform itself is only 12 years old, LexisNexis has over 35 years of experience in the data and information services business, data analytics and scoring, full text storage and retrieval, entity extraction and resolution and data linking, which the HPCC platform has leveraged.
  • Fewer moving parts: Unlike Hadoop, HPCC is an integrated solution extending across the entire data lifecycle, from data ingestion and data processing to data delivery. No third party tools are needed. This consistency reduces operating costs, increases overall availability and reliability, simplifies capacity planning and provides with a single view to the overall data workflow processes.
  • Multiple data types: The HPCC platform supports multiple data types out of the box, including fixed and variable length delimited records and XML. The particular data model is open for the data analyst to define based on business needs without the constrains imposed by, for example, the strict key-value store models offered by Hadoop and NoSQL systems. Unlike Hadoop HDFS, HPCC prevents records from being split across nodes, which has a positive impact on performance, and since DFS (the distributed filesystem used by HPCC) is record oriented and not block oriented (unlike HDFS), it makes more efficient use of space and provides for better tolerance to data skews.

Beyond MapReduce

  • Open Data Model: Unlike Hadoop, the data model is defined by the user, and it's not constrained by the limitations of a strict key-value paradigm. As a consequence, the HPCC platform adapts more naturally to expressing data problems in terms of computational algorithms which reduces complexity and considerably shortens software development times.
  • Simple: Unlike Hadoop MapReduce, solutions to complex data problems can be expressed easily and directly in terms of high level ECL primitives. With Hadoop, creating MapReduce solutions to all but the most simple data problems can be a daunting task. The majority of these complexities are eliminated by the flexible HPCC programming model, thanks to the fact that programmers are not restricted to just think in terms of Map, Shuffle and Reduce operations when designing their data algorithms.
  • Truly parallel: Unlike Hadoop, nodes of a datagraph can be processed in parallel as data seamlessly flows through them. In Hadoop MapReduce (Java, Pig, Hive, Cascading, etc.) almost every complex data transformation requires a series of MapReduce cycles; each of the phases for these cycles cannot be started until the previous phase has completed for every record, which contributes to the well-known "long tail problem" in Hadoop. HPCC effectively avoids this, which results in higher and predictable performance and a more natural flow of data across the execution graph.
  • Powerful optimizer: The HPCC optimizer ensures that submitted ECL code is executed at the maximum possible speed for the underlying hardware. Advanced techniques such as lazy execution and code reordering are thoroughly utilized to maximize performance. The ECL optimizer also re-orders operations to maximize the utilization of the specific hardware architecture and increase parallelization levels to shorten overall execution times.

Roxie Delivery Engine

  • Low latency: Even complex Roxie queries are typically completed in fractions of a second. Latencies are measurable and predictable across the environment simplifying the creation of Service Level Agreements. An outstanding property in Roxie is that latencies timings are maintained for a large range of loads. Built-in redundancy also ensure that latencies are not degraded when a single node is slowed down as it would happen, for example, when a hard drive fails.
  • Not a key-value store: Unlike HBase, Cassandra and others, Roxie is not limited by the constrains of key-value data stores, allowing for complex queries, multi-key retrieval, fuzzy matching and more. Advanced capabilities provide for dynamic indices and other state of the art functionality. Roxie queries (equivalent to stored procedures in traditional RDBM systems) are exposed through SOAP, RESTful and JSON interfaces to ease integration with other systems.
  • Highly available: Roxie is designed to operate in critical environments, under the most rigorous service level requirements. The degree of redundancy is user defined, with a default setting of 2x, which could account for very stringent business operating standards. One key benefit that Roxie offers is the fact that the redundant nodes are part of the pool used to serve queries under normal conditions, providing for higher concurrency if needed, for example, to absorb transient load peaks that would not justify deploying extra systems under normal conditions. The absence of single points of failure (SPOF) is paramount to ensure uninterrupted operation under the failure of any component in the system.
  • Scalable: Horizontally linear scalability provides room to accommodate for future data and performance growth. As data or load increases, additional nodes can be provisioned to add space for more data or computing power to support higher concurrent load requirements. Since the architecture is truly distributed, there is no single component that can become a bottleneck to the overall system performance.
  • Highly concurrent: In a typical environment, thousands of concurrent clients can be simultaneously executing transactions on the same Roxie system. Thanks to the distributed nature of the Roxie architectural components, load is spread across the entire system, ensuring that no node or group of nodes is subject to load levels that could degrade the overall performance.
  • Redundant: A shared-nothing architecture with no single points of failure (SPOF) provides extreme fault tolerance. Default redundancy is 2x (twice as many nodes as needed to store two copies of the data), but is user definable and higher levels of redundancy can be configured to tolerate even higher degrees of failure across the system, which could be required when operating in datacenter environments that can be easily serviced, for example.
  • ECL inside: One language to describe both: the data transformations in Thor and the data delivery strategies in Roxie.
  • Consistent tools: Thor and Roxie share the same exact set of tools which provides consistency across the platform.

Enterprise Control Language (ECL)

  • Declarative programming language: Declarative programming languages present numerous advantages over the more conventional imperative programming model. It essentially allows the programmer to express the logic of the computation without describing its flow control, and hence the expression: "tell the system what it should do, rather than how to go about doing it". In addition to simplifying the design and implementation of complex algorithms, it also improves the quality of the programs by, for example, minimizing or eliminating the presence of side effects, which has a positive impact on code testing and maintainability. Programs are easier to understand, verify and extend, even by people who are not familiar with the original design, which also helps alleviate the learning curve for new programmers.
  • Powerful: ECL has been designed specifically as a Data Oriented programming language from the grounds up. Unlike Java, high level data primitives such as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, NORMALIZE, etc. are first class functions. Basic data operations can be achieved in a single line of code, which makes ECL ideal as a programming language because data analysis can be used to express data algorithms directly, avoiding long development cycle times writing specifications that software developers can use to write the programs. In essence, higher level code means less programmers and shorter time to deliver complete projects.
  • Extensible: In ECL, as new attributes are defined, they become primitives that other programmers can benefit from. Thanks to code and data encapsulation (which data oriented languages like Pig and Hive don't have) programmers can reuse existing ECL attributes without being concerned with the internal implementation of these attributes. This provides for code that is easier to understand, more compact, and simplifies future development.
  • Implicitly parallel: Parallelism is built into the underlying platform and ECL provides a layer of abstraction over the architecture of the platform. The exact same ECL program will run in a single node system or in a 1000 nodes cluster. The programmer needs not be concerned with parallelization, and the optimizer ensures the best performance for the specific hardware platform.
  • Maintainable: A High level programming language, no side effects and attribute encapsulation provide for more succinct, reliable and easier to troubleshoot code. ECL programs composed of dozens of lines typically express algorithms which would require thousands of lines in Java in the Hadoop MapReduce world. Moreover, imperative programming languages like Java, allow and even sometimes encourage side effects, make testing difficult as knowledge about the context and its possible histories is mandatory for the complete analysis of the software behavior; as a consequence, these programs are hard to read, understand and debug. ECL programs are easily readable and the lack of side effects makes them easily verifiable.
  • Complete: ECL programs are complete and can express algorithms that span across the entire data workflow. Unlike Pig and Hive, which have been originally designed to write small code snippets, ECL provides for a mature programming paradigm, encouraging collaboration, code reuse, encapsulation, extensibility and readability.
  • Homogeneous: ECL is the one language to express data algorithms across the entire HPCC platform. In Thor, ECL expresses data workflows consisting of data loading, transformation, linking, indexing, etc.; in Roxie, it defines data queries (the HPCC equivalent to stored procedures in traditional RDBMS). Data analysts and programmers need to learn one language to define the complete data lifecycle.
Download the HPCC Platform

No comments:

Post a Comment

Please feel free to contact or comment the article

Search This Blog