5 private links
A free online book on using hypermedia controls effectively.
It introduces the library htmx which allows you to reach the interactivity of a single-page application (SPA) in traditional web applications which render templates on the server-side.
It also introduces an approach to making mobile apps using a library called HyperView.
To get a feel of what is possible with this library, see this user report where a company switched from a traditional frontend/backend separation using React to doing full-stack development using HTMX.
A collection of resources to teach yourself Computer Science.
The recommended resources are better than what you'd find in most university courses. I believe this is a good investment of time for self-taught computer programmers (@ 100 to 200 hours per subject).
Guides on various computer science concepts. I am especially interested in the guide on UNIX inter-process communication.
How to set up Colemak as the login shell in Gnome desktop in Debian 12.2
Ktor seems to be the JVM ecosystem's answer to NodeJS.
This seems like a good enough naming convention to me, though I'd be fine using either snake_case or lisp-case for the topic names.
Can be considered a light-weight alternative to TT-RSS. There's also a mobile app called miniflutt.
Paul Lafargue, the son-in-law of Karl Marx, starts by writing about the Christian dogma of work that served capitalist interests turning rural peasants of leisure into factory workers who slaved away at machines for 12 hours a day and spent 4 hours walking between work and home.
Meanwhile the capitalists of the 19th century were forced to over-consume themselves into ill health and be constantly in search of new markets for their overflowing and unnecessary goods. There were as many domestic workers serving the capitalist class as they were industrial workers in England.
He then sees a transformation of society where the proletariat recognizes their rights, punishes those who pushed harmful ideologies on them and restricts work to 3 hours a day.
The author also mentions ancient Greek civilization, where despite having slaves the citizens and soldiers of democracies didn't engage in shop work or mercantile trade. They despised wage slavery most of all, with anyone engaging in it punishable by imprisonment.
The conditions described by Lafargue in France closely resemble the sweat-shop work done by people of poor third-world countries in service of multi-national corporations (refer to No Logo by Naomi Klein). Wage slavery still exists. We no longer seem to have the citizens of ancient Greece who didn't do such shop work.
It's a sign of shame on modern human civilization that many of the evils mentioned by Lafargue in early stage industrial capitalism are prevalent to this day.
Bluesky's faux-open AT protocol is kinda like ActivityPub, but people host their own posts thus making content creators pay for their own content but the distribution of the content would be done by huge centralized entities - probably only Bluesky.
Maybe people will make great third-party apps for Bluesky which further saves Bluesky PBLLC some more money. The implementation of the content servers can be open source too.
Don't pay for the code. Don't pay for the content, but inject ads indistinguishable from the content into the feed, because you control the means of distribution. This is a plan that would sound brilliant to any venture capitalist.
I honestly didn't expect this to be anything but a scam since the day Jack Dorsey announced this as a reaction to ActivityPub. Embrace, Extend and Extinguish in action.
A brilliant article from Naomi Klein on Silicon Valley's hallucinations about Generative AI.
I have created/used Service Templates myself. "Stamping out a new service" seems apt as an analogy.
I have difficulty associating with chassis as an analogy in this context. But it's not really a scaffolding either. It's common for multiple cars from a company to share the same common platform (an analogy used in platform engineering), but it's not common for them to share the same chassis.
An important read on forming organizations that resist the established social structures. Just as relevant to anarchist organizations as to feminist ones.
This 3-part series on collecting data, alerting and investigating performance issues is an essential read for anyone maintaining large scale systems in production.
A blog post on treating shell scrips like they're real code.
Some of these patterns could also be useful in other applications.
It is inspiring to see that it is possible to restore land masses degraded over millennia. However, I felt that this documentary oversells restoration as a solution to the climate crisis. Maybe 2009 was a more optimistic time.
This manifesto has my full endorsement.
This is a 6-hour course on Spark, available as one YouTube video. Many things in this video are still relevant in 2022.
- Spark and Mesos were started in AMPLab in UC Berkeley
- Spark was originally a sample app for Mesos, to "spark" an ecosystem of apps for Mesos.
- Spark is an engine for scheduling, monitoring and distributing Big Data
- Hive queries work out of the box in Spark SQL because it uses the same frontend and Metastore
- Spark Streaming has pretty much replaced Apache Storm. Spark in general has replaced most of the Hadoop ecosystem, which consisted of too many tools.
- All higher level things push the compute down to Spark Core
- Streaming, SQL, MLlib, GraphX, BlinkDB, Tachyon
- BlinkDB can do time-limited queries over large amounts of data. Also returns confidence level in the result.
- Tachyon is a memory-based distributed storage system
- Spark's Resource Managers
- YARN
- local-mode
- Mesos (stands for mediator in Greek)
- standalone (datastax Cassandra)
- Resource managers can be made highly-available by storing their configuration in Zookeeper
- Storage: almost any filesystem or NoSQL database. From anything that has a Hadoop Input Format.
- Fun fact: Tableau also uses Spark as its engine
- Hadoop MapReduce constantly needs writing to HDFS. Spark does it all in memory. Hence, it is usually at least 10x faster.
- Recommended reading
- Original Spark white paper and another white paper on RDDs
- "Learning Spark" from O'Reilly is the best book
- Spark packages can be found at https://spark-packages.org
- Integration with other tools, libraries for some tasks.
- Spark Shell always runs on the Driver
- In RDDs, the more partitions you have the more parallelism you have. Each partition requires a task, essentially a JVM thread.
RDDs might even store images and videos.
Theparallelize
method on sparkContext object can be used to create an RDD. Reading from a file can also create RDDs, but the partitions it creates might not be optimal. Need torepartition
.
A transformation like afilter
could produce empty partitions.coalesce
can be used to reduce the number of partitions in an RDD.
.collect
call pushes the data back to the Driver JVM from the workers. Ensure that the Driver has enough memory to do this. It's better to just push the data back to HDFS or Cassandra. Or just collect only a sample of the larger RDD. - All of the transformations in Spark are lazy. It builds a DAG of transformations, but doesn't actually execute the transformations. Only when an action is called does the DAG get executed aka materialized.
- Caching RDDs
- An RDD has to be materialized before it can be cached.
- If not cached, all the RDDs involved in a transformation will be removed. So, what to cache? Whatever RDD is used multiple times. Don't cache the base RDD, maybe cache the cleaned RDD.
- If an RDD doesn't fully fit into memory, Spark will store the rest on disk. Also applies to cached RDDs. Accessing disk would slow down Spark.
- Resource Managers
Standalone, YARN and Mesos have Dynamic Partitioning (no. of executors)
Local has only Static Partitioning - Parallelism: Hadoop Map-Reduce used JVM processes for parallelism, whereas Spark uses JVM threads in its executor JVMs. Hadoop MR had dedicated slots for Map and Reduce operations which was inefficient on resource utilization.
- Over-subscription: Set the number of tasks to be like 2-3x the number of cores on your machine. Setting them equal to the number of cores isn't necessary. Let the Executor JVM handle its threads.
spark-submit
is a command to submit your own Scala programs to the Spark Driver, which submits the work to Spark Master.- Standalone mode (with Datastax Cassandra)
- Master and Worker use little resources compared to the Executors. Masters can be HA and even be added to a live cluster.
- 1 Worker can only run 1 Executor for each application. To have more than one Executor on each machine for a single application, a corresponding number of Workers should be started as well.
- Mounted disks are stored in the
SPARK_LOCAL_DIRS
environment variable. Can store RDDs (fully or partially) on these disks. - It's possible to have a heterogenous composition of machines. Spark will know the number of cores on each machine from the variable
SPARK_WORKER_CORES
in spark-env.sh.
SPARK_WORKER_MEMORY
is the amount of memory that a Spark Worker can give to its Executors. Similarly, a Master's memory is the total amount of memory that it can allocate to its workers. - YARN: Yet Another Resource Negotiator
- Spark can run in either client-mode or cluster-mode on YARN.
In client-mode, the Driver runs on a client machine.
In cluster-mode, the application and the Driver are submitted to the cluster at the same time. No dependency on the client machine. - YARN executor can also dynamically increase and decrease the number of executors.
- Spark can run in either client-mode or cluster-mode on YARN.
- Persistence
- Persisting in memory with serialization can be more space efficient, but costs CPU. Use a very efficient serialization format.
- Caching of RDDs is like an LRU cache. If there isn't enough memory, the oldest RDDs will be evacuated first.
- In memory + disk storage, partitions of an RDD that are the oldest will be moved to the disk in a serialized format. By default the in-memory parts will not be serialized.
MEMORY_ONLY_2
can store the RDD on two different JVMs. Only to be used for extremely expensive RDDs. - Tachyon (now defunct) can be used for
OFF_HEAP
storage. Can be used for sharing RDDS between different Spark applications written in multiple languages. - Shuffle operations automatically persist all the RDDs.
- Local Dirs will be cleaned up in an LRU fashion.
- PySpark objects are always serialized using the Pickle library. Python object representations are not used even when persisted in memory.
- Lineage of RDDs
- RDDs can be produced by narrow transformations or wide transformations.
In narrow transformations, one partition in the parent is used by only one partition in the child. Could be one-to-one or many-to-one. Can be executed in parallel.
In wide transformations, one partition in the child could depend on multiple partitions in the parent. Needs shuffling. Costly to recreate. - Pipelining uses an internal optimization in Spark such that one thread can do multiple transformations on a partition.
toDebugString
prints out the Lineage of an RDD, with stage boundaries represented using indentation
- RDDs can be produced by narrow transformations or wide transformations.
- Shuffling
- repartition, join, cogroup, By and ByKey all cause shuffles.
- numPartitions parameter might cause a shuffle
- coalesce never causes a shuffle.
- Broadcast variables can avoid shuffles. This is a small amount of data, usually a read-only lookup table that is broadcast from the Driver to all the Executors.
Broadcast uses the BitTorrent protocol. - Accumulators are like counters in Hadoop. They count events that occur during job execution. Good for debugging.
They only support associative operations.
Only the Driver program will be able to read an accumulator's value. Tasks don't know this value. - (PySpark (Java API (Spark Core Engine)))
Can use CPython or PyPy, for both Driver and Worker machines. PyPy is better when not using C libraries. - Netty Native Buffer bypasses the kernel buffers and JVM buffers to get data directly from disk to the network. Avoids two copies.
- Shifted from Hash-based Shuffle to Sort-based Shuffle
- Spark Streaming can read data at high rates from various sources, do transformations and then write to various storage options.
- Processing latency must be lower than batch latency, otherwise it will fall behind.
- Spark Streaming uses micro-batches whereas Storm processes each event as it arrives. Storm Trident is like Spark Streaming. Storm uses at-least-once processing, so an event could be processed more than once. The micro-batches in Spark Streaming are called DStreams.
- Batch interval is user configurable.
- Can do sliding window operations on DStreams.
I have been using Nushell as my primary shell for about 6 months.
I don't miss Bash and Zsh yet, though I have to switch to bash once in a while to paste commands I find online.
Defining custom functions in Nushell is a pleasure since it uses a Ruby-like language.
Nushell also has several built-in commands which obviate the need for using Unix utilities which are too weird. I especially don't miss GNU Parallel, since Nushell's language supports easy parallelization.
You might appreciate Nushell if you are a data engineer/scientist since it provides the ability to use dataframes using Polars.
Yes, in a way, Nushell is a data pipeline tool pretending to be a shell.
I have no intention of ever learning PowerShell, but might end up working on a Windows machine at some point in my career. Nushell has me covered here. It works on Windows too!