5 private links
This manifesto has my full endorsement.
This is a 6-hour course on Spark, available as one YouTube video. Many things in this video are still relevant in 2022.
- Spark and Mesos were started in AMPLab in UC Berkeley
- Spark was originally a sample app for Mesos, to "spark" an ecosystem of apps for Mesos.
- Spark is an engine for scheduling, monitoring and distributing Big Data
- Hive queries work out of the box in Spark SQL because it uses the same frontend and Metastore
- Spark Streaming has pretty much replaced Apache Storm. Spark in general has replaced most of the Hadoop ecosystem, which consisted of too many tools.
- All higher level things push the compute down to Spark Core
- Streaming, SQL, MLlib, GraphX, BlinkDB, Tachyon
- BlinkDB can do time-limited queries over large amounts of data. Also returns confidence level in the result.
- Tachyon is a memory-based distributed storage system
- Spark's Resource Managers
- YARN
- local-mode
- Mesos (stands for mediator in Greek)
- standalone (datastax Cassandra)
- Resource managers can be made highly-available by storing their configuration in Zookeeper
- Storage: almost any filesystem or NoSQL database. From anything that has a Hadoop Input Format.
- Fun fact: Tableau also uses Spark as its engine
- Hadoop MapReduce constantly needs writing to HDFS. Spark does it all in memory. Hence, it is usually at least 10x faster.
- Recommended reading
- Original Spark white paper and another white paper on RDDs
- "Learning Spark" from O'Reilly is the best book
- Spark packages can be found at https://spark-packages.org
- Integration with other tools, libraries for some tasks.
- Spark Shell always runs on the Driver
- In RDDs, the more partitions you have the more parallelism you have. Each partition requires a task, essentially a JVM thread.
RDDs might even store images and videos.
Theparallelize
method on sparkContext object can be used to create an RDD. Reading from a file can also create RDDs, but the partitions it creates might not be optimal. Need torepartition
.
A transformation like afilter
could produce empty partitions.coalesce
can be used to reduce the number of partitions in an RDD.
.collect
call pushes the data back to the Driver JVM from the workers. Ensure that the Driver has enough memory to do this. It's better to just push the data back to HDFS or Cassandra. Or just collect only a sample of the larger RDD. - All of the transformations in Spark are lazy. It builds a DAG of transformations, but doesn't actually execute the transformations. Only when an action is called does the DAG get executed aka materialized.
- Caching RDDs
- An RDD has to be materialized before it can be cached.
- If not cached, all the RDDs involved in a transformation will be removed. So, what to cache? Whatever RDD is used multiple times. Don't cache the base RDD, maybe cache the cleaned RDD.
- If an RDD doesn't fully fit into memory, Spark will store the rest on disk. Also applies to cached RDDs. Accessing disk would slow down Spark.
- Resource Managers
Standalone, YARN and Mesos have Dynamic Partitioning (no. of executors)
Local has only Static Partitioning - Parallelism: Hadoop Map-Reduce used JVM processes for parallelism, whereas Spark uses JVM threads in its executor JVMs. Hadoop MR had dedicated slots for Map and Reduce operations which was inefficient on resource utilization.
- Over-subscription: Set the number of tasks to be like 2-3x the number of cores on your machine. Setting them equal to the number of cores isn't necessary. Let the Executor JVM handle its threads.
spark-submit
is a command to submit your own Scala programs to the Spark Driver, which submits the work to Spark Master.- Standalone mode (with Datastax Cassandra)
- Master and Worker use little resources compared to the Executors. Masters can be HA and even be added to a live cluster.
- 1 Worker can only run 1 Executor for each application. To have more than one Executor on each machine for a single application, a corresponding number of Workers should be started as well.
- Mounted disks are stored in the
SPARK_LOCAL_DIRS
environment variable. Can store RDDs (fully or partially) on these disks. - It's possible to have a heterogenous composition of machines. Spark will know the number of cores on each machine from the variable
SPARK_WORKER_CORES
in spark-env.sh.
SPARK_WORKER_MEMORY
is the amount of memory that a Spark Worker can give to its Executors. Similarly, a Master's memory is the total amount of memory that it can allocate to its workers. - YARN: Yet Another Resource Negotiator
- Spark can run in either client-mode or cluster-mode on YARN.
In client-mode, the Driver runs on a client machine.
In cluster-mode, the application and the Driver are submitted to the cluster at the same time. No dependency on the client machine. - YARN executor can also dynamically increase and decrease the number of executors.
- Spark can run in either client-mode or cluster-mode on YARN.
- Persistence
- Persisting in memory with serialization can be more space efficient, but costs CPU. Use a very efficient serialization format.
- Caching of RDDs is like an LRU cache. If there isn't enough memory, the oldest RDDs will be evacuated first.
- In memory + disk storage, partitions of an RDD that are the oldest will be moved to the disk in a serialized format. By default the in-memory parts will not be serialized.
MEMORY_ONLY_2
can store the RDD on two different JVMs. Only to be used for extremely expensive RDDs. - Tachyon (now defunct) can be used for
OFF_HEAP
storage. Can be used for sharing RDDS between different Spark applications written in multiple languages. - Shuffle operations automatically persist all the RDDs.
- Local Dirs will be cleaned up in an LRU fashion.
- PySpark objects are always serialized using the Pickle library. Python object representations are not used even when persisted in memory.
- Lineage of RDDs
- RDDs can be produced by narrow transformations or wide transformations.
In narrow transformations, one partition in the parent is used by only one partition in the child. Could be one-to-one or many-to-one. Can be executed in parallel.
In wide transformations, one partition in the child could depend on multiple partitions in the parent. Needs shuffling. Costly to recreate. - Pipelining uses an internal optimization in Spark such that one thread can do multiple transformations on a partition.
toDebugString
prints out the Lineage of an RDD, with stage boundaries represented using indentation
- RDDs can be produced by narrow transformations or wide transformations.
- Shuffling
- repartition, join, cogroup, By and ByKey all cause shuffles.
- numPartitions parameter might cause a shuffle
- coalesce never causes a shuffle.
- Broadcast variables can avoid shuffles. This is a small amount of data, usually a read-only lookup table that is broadcast from the Driver to all the Executors.
Broadcast uses the BitTorrent protocol. - Accumulators are like counters in Hadoop. They count events that occur during job execution. Good for debugging.
They only support associative operations.
Only the Driver program will be able to read an accumulator's value. Tasks don't know this value. - (PySpark (Java API (Spark Core Engine)))
Can use CPython or PyPy, for both Driver and Worker machines. PyPy is better when not using C libraries. - Netty Native Buffer bypasses the kernel buffers and JVM buffers to get data directly from disk to the network. Avoids two copies.
- Shifted from Hash-based Shuffle to Sort-based Shuffle
- Spark Streaming can read data at high rates from various sources, do transformations and then write to various storage options.
- Processing latency must be lower than batch latency, otherwise it will fall behind.
- Spark Streaming uses micro-batches whereas Storm processes each event as it arrives. Storm Trident is like Spark Streaming. Storm uses at-least-once processing, so an event could be processed more than once. The micro-batches in Spark Streaming are called DStreams.
- Batch interval is user configurable.
- Can do sliding window operations on DStreams.
I have been using Nushell as my primary shell for about 6 months.
I don't miss Bash and Zsh yet, though I have to switch to bash once in a while to paste commands I find online.
Defining custom functions in Nushell is a pleasure since it uses a Ruby-like language.
Nushell also has several built-in commands which obviate the need for using Unix utilities which are too weird. I especially don't miss GNU Parallel, since Nushell's language supports easy parallelization.
You might appreciate Nushell if you are a data engineer/scientist since it provides the ability to use dataframes using Polars.
Yes, in a way, Nushell is a data pipeline tool pretending to be a shell.
I have no intention of ever learning PowerShell, but might end up working on a Windows machine at some point in my career. Nushell has me covered here. It works on Windows too!
Data Pipelines have two orthogonal planes: Code Plane and Data Plane
A crash course on Data Engineering made by data engineers from Thoughtworks.
This is a two-part series by Benjamin Rogojan aka the Seattle Data Guy
"...some software engineers move into data engineering roles assuming they will be working on distributed systems or developing infrastructure for data teams. This isn’t always the case.
These roles are generally denoted as 'software engineering, data or data infra' at many Big Tech companies. That being said, there are plenty of large tech companies whose data engineer role is more like a software role. It’s just not always apparent from the job description."
Remember that the 14KB rule is for the first 10 TCP packets combined.
This is a thumb rule. Some servers might send up to 30 packets.
True to what it preaches, this website does load very fast.
Prefectural University of Hiroshima had to shift to online classes at the beginning of the pandemic. They were evaluating various real-time video conferencing solutions for this purpose. The students stayed on campus and had poor Internet connectivity. The paper is a comparison of using Zoom, Microsoft Teams and Big Blue Button for this purpose.
Though the three solutions do not significantly differ in terms of features, BBB can work with a smaller bandwidth consumption. It is advised to not move the pointer on the slides since that would turn a mostly static image into an expensive video stream.
The findings are not surprising since BBB is a custom software built for this specific purpose, whereas Zoom and MS Teams are general purpose software for video conferencing and messaging. Teachers should be trained to use BBB effectively for its intended purpose.
Since nearly half of all carbon emissions in video conferencing happen on the network, self-hosting a video conferencing server like BBB as close to the users as possible can a have significant reduction in emissions.
This talk was recommended to be me by the Irreal blog. This is my running commentary of the video.
Emacs use cases
- Using emacs as a PDF editor and viewer. Is it really as efficient as Evince or Okular though?
- Using emacs for mail. Seems legit. I didn't setup mine yet (mu4e).
- magit. duh! It's better than the Git CLI itself.
- Spreadsheets! I did use emacs for tables, but not really doing calculations on them.
- Browsing the web. Yeah, I did use elinks for as long as my emacs mastodon client worked.
- Coding. Of course! Didn't get to write much Clojure yet, but I heavily use Emacs for Python.
Literate emacs configuration. Too much work!
Emacs info manuals. I didn't use them in years.
Emacs is from 1976! GNU Emacs is from 1985. The mother of all free software.
Emacs is a real-time display editor that is extensible.
M-x describe-gnu-project
- opens a webpage on GNU website.
Playing as adults, within constraints. Playing an instrument. Playing is the original definition of hacking!
mu4e + org-capture is a great combo for creating task lists.
Narrowing the buffer is something I have to learn. Also, using org-mode hyperlinks more.
Once I had to do a lot of configuration modifications on a server. I installed emacs on it and also copied my configuration over, instead of just using stramp
.
eshell is the only really portable built-in emacs shell, because thes script is all in elisp.
Check out impatient-mode for viewing changes without browser reloads. I was usingentr
for this so far.
The vim people in the audience were probably not impressed by editor macros, but the people who use neither vim nor emacs must be.
I didn't use winner-mode for anything more than shifting between splits. But it can also store the configuration of window splits on a stack.
Undo-tree. Never used this, but I should. It's a built-in feature.
Emacs had hooks before they were cool. "Advising" a function allows you to add before and after hooks to a function.
Emacs is great for most dynamically-typed programming languages. Opening a REPL in another buffer and moving code snippets back and forth is quite fun.
Emacs is a Lisp REPL.
This is compressed wisdom. Read it once in a few months.
An interesting blog post about debugging an issue with log files, discovering the bug in the Linux kernel, discovering that it is a vulnerability and subsequently fixing it.
The description of their access logging system is quite interesting too.
CVE-2022-0847
A list of desired features for the next generation of Fediverse platforms, from a user's perspective. Non-technical.
Talks about some of the features from older platforms like diaspora*, Friendica and Hubzilla.
Related: Bonfire Networks
"While the Rabbits are clear that theirs isn’t a reproducible lifestyle for everyone, their offline computing microcosm is an extraordinary fusion of philosophy and practice that’s hard to find in a world where convenience is king."
"In an age where constant connection is a default way of life, the Rabbits’ offline-first philosophy is a curious but compelling form of technological radicalism that feels like a more thoughtful, evolved Luddism. Because the Uxn is designed to stave off bitrot — the digital decay of old information that we can no longer read — its significance to art preservation, creativity, and sustainability is more powerful than anything being peddled on a blockchain."
Summary (from the PDF)
"The human and economic cost of the COVID-19 pandemic has been staggering, in terms of lives lost, human suffering, and economic damage. As we enter the third year of the pandemic, we still find ourselves on a rollercoaster of lockdowns, variants, and broken promises. Inequality has actively prolonged the pandemic, devastating lives and livelihoods. Women have shouldered an especially heavy burden."
"While effective vaccines provide hope, their rollout has tipped, from a natural desire to protect citizens, into nationalism, greed, and self-interest. Large numbers of people in low-income countries face the virus unprotected and millions of people would still be alive today if they had had access to a vaccine. Big pharmaceutical corporations have been given free rein to prioritize profits ahead of vaccine equality. As we mark two years since the official pandemic was declared, we still have a chance to gain the upper hand on the virus, but only if everyone, everywhere has access to vaccines and treatments."
Decentralized food ecosystems built on an open platform.
Social tools built for and by communities.