Capture Injection (ongoing)

Building an AI-Driven Workflow Automation System
Project Log (12/08/2024):

I don't really have a name for this project yet, but I wanted to document everything as I go (as opposed to some of my other projects). I know this project will take a significant portion of time and require some skills. It will also be my first AI-ish project.

The idea comes from me not wanting to do my homework, classwork, tests, etc. I don't think it's really a solid ROI, but for me, not wanting to jump through hoops to get a piece of paper, it's worth it. I also enjoy coming up with creative solutions that test my skills. Maybe school is too easy and I get bored. Some people have told me that before and I used to read books on robotics in chemisty class (I failed that twice) so... Anyways, I'll just end up spending my extra time studying law and prepping for the LSAT.

So what's the plan? This is best explained in two parts...

Part one is to capture the output from my Mac Studio, do some filtering because HDCP is annoying. In Fig. 1, you can see this as the Vertex Stripper and the Elgato capture card. My work is to write some software that will run on my Mac Mini to process what's on the screen (input from the capture card). This shouldn't be too hard, but I haven't really solved this issue yet. I did however do some testing and it seems like it's do-able. Once the Vertex shows up in the mail I'll have a more definative answer b/c things don't always work like they say they will but hopefully a $300 device does what it says it does.

Part two will involve the Arduino HID pass-through/injection. I will need to allow my Apple Magic Keyboard and Apple Magic Mouse to operate normally until specific key combinations are pressed. If those special key combinations are pressed, then the Arduino will start listening for specific inputs that tell the Mac Mini to do something. I did some testing on a Arduino MKR 1010 but it seems that there's another chip that interfaces with/as the HID device so the functionality is limited. It seems the Arduino Nano ESP32 solves this issue, we will see...

I want to give an example so let's assume I want it to answer a question on my screen (Mac Studio). I'd press some key combination, the Arduino will then send a signal (via Wi-Fi) to the Mac Mini. The Mac Mini will execute a script that does some image processing, uses a large language model (LLM) to find an answer, or makes an API request to get the answer. Then it can spit out the answer or replace my keystrokes with the correct answer. I'm thinking of having it work somewhat like a textbox terminal—e.g., type a command, it erases my command and responds, then erases its response. IDK, we will see how it goes.

Lots of work ahead but I elimated some or the more signifigant edge cases that I could. First, I'll need to finalize the HDCP bypass and video capture setup, ensuring the Mac Mini can reliably process screen data from the Mac Studio. This involves writing software that can handle the captured input and potentially preprocess it for further tasks like image recognition. Next, I need to build out the Arduino-based HID system, including creating the logic for detecting and interpreting special key combinations. I'll also need to establish robust Wi-Fi communication between the Arduino and the Mac Mini, ensuring it can trigger scripts and handle responses in real-time. On the software side, developing scripts for tasks like text recognition, API interactions, or LLM processing will require exploration and experimentation. It's a lot of moving pieces, but with proper planning and iterative development, I’m confident I can bring it all together. The key is to start small, solve one problem at a time, and document everything thoroughly as I progress.

Figure 1

Overcoming Edge Cases in Modern Keyboard Integration
Project Log (12/28/2024):

Starting with part one from earlier, you can see some heavy duty parts got removed.

  • Removed Vertex Module
  • Removed El Gato Capture Card
  • Added Random Amazon Capture Card

I was flashing the Vertex capture card and it got bricked. It worked initially but had issues bypassing HDCP 2.2. I think this was the fault of the ElGato capture card not the Vertex. Also, the ElGato captue card doesn't really work as well as it claims or at least it's misleading in some aspects. In a last ditch attempt I got a $35 capture card off of Amazon because there were reviews on the internet that said this would work to bypass HDCP 2.2. Sure enough it fucking did...🙄 All that money and time spent just to find out a POS from China did what I needed. I don't really need the video feed to be 4k so this setup with 1080p 30fps works. Honestly, I probably would have had to downscale the feed anyways because trying to manipulate a 4k 60fps video feed would kill my Mac Mini. It would have been nice to have the higher resolution to make OCR easier, famous last words right?

Now on to part two I ended up making a few changes to that setup. You can see all these changes reflected in "Figure 2" below.

  • Only using Keyboard (Mouse will be connected directly to Mac Studio)
  • Switched to Arduino® UNO R4 WiFi
  • Added SparkFun USB-C Host Shield

I decided the Magic Mouse doesn't really need to be used for anyhting special and in the event of a failure of the Arduino® UNO R4 WiFi then I will still have the ability to use my cursor. I could potentially use it to send commands and stuff but realistically I don't need to complicate things.

The order from Arduino got significantly delayed so I went to Jameco across the Bay to pick up two Arduino® UNO R4 WiFi boards to see if I could make that work. Turns out the new boards have a diffrent archetecture (Renesas RA4M1, Arm® Cortex®-M4, with a 48 MHz clock speed, 32 kB SRAM and 256 kB flash memory) and most libraries haven't been updated to support the new chip. I was able to get the Keyboard libarary from Arduino to type some things but getting it to read from the keyboard was challanging. This has to do with the HID 2.0 handshake/comunication protocol. I belive could have done it but I decided to make my life easier and opt for the SparkFun USB-C Host Shield. I played around with the examples from the USB Host Shield 2.0 Library. That worked great! Took a while to understand the library but I wrote this bit of code that allows every button on the Apple Magic Keyboard to be read.

// rawHexTesting.ino
#include <hidboot.h>
#include <usbhub.h>
#include <SPI.h>

// Custom parser class with a unique name
class CustomKeyboardParser : public KeyboardReportParser {
protected:
    // Override the Parse method to print raw data
    void Parse(USBHID *hid, bool is_rpt_id, uint8_t len, uint8_t *buf) override {
        // Print raw data
        for (uint8_t i = 0; i < len; i++) {
            if (buf[i] < 0x10) Serial.print("0"); // Add leading zero for single digit hex values
            Serial.print(buf[i], HEX);
            Serial.print(" ");
        }
        Serial.println();

        // Call the base class implementation for further parsing
        KeyboardReportParser::Parse(hid, is_rpt_id, len, buf);
    }
};

// USB and HID setup
USB Usb;
HIDBoot<USB_HID_PROTOCOL_KEYBOARD> HidKeyboard(&Usb);
CustomKeyboardParser Parser;

void setup() {
    Serial.begin(921600); // Set baud rate to 921600 so I can see keys realtime in serial monitor
    while (!Serial); // Wait for serial connection

    if (Usb.Init() == -1) {
        Serial.println("USB initialization failed");
        while (1); // Stop here if USB initialization fails
    }
    Serial.println("USB initialized");

    HidKeyboard.SetReportParser(0, &Parser); // Set the custom parser
}

void loop() {
    Usb.Task(); // USB task
}
									
									

The serial monitor looks like this when pressing keys...

									
01 00 00 04 00 00 00 00 00 00 - "a" KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 00 00 05 00 00 00 00 00 00 - "b" KEYDOWN
01 00 00 00 00 00 00 00 00 00 - "ALL KEYS UP
01 00 00 06 00 00 00 00 00 00 - "c" KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 01 00 00 00 00 00 00 00 00 - LEFT CONTROL KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 04 00 00 00 00 00 00 00 00 - LEFT OPTION KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 08 00 00 00 00 00 00 00 00 - LEFT COMMAND KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 00 00 00 00 00 00 00 00 04 - Apple Finger Print Reader KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
01 00 00 6E 00 00 00 00 00 00 - SPACEBAR KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
									
									

You would think this would all be easy, right? But there's so many edge cases associated with modern keyboards. - Show Table

Now here's the fun part and one of the edge cases I had to deal with... modifier keys! For exaple lets say you need to select all eg "Command + a". Hoe does the computer know you're pressing both at the same time? Does it just send "Command" and then "a"? Well sort of, it ends up looking like this...

											
01 08 00 00 00 00 00 00 00 00 - LEFT COMMAND KEYDOWN 
01 08 00 04 00 00 00 00 00 00 - LEFT COMMAND KEYDOWN + "a" KEYDOWN
01 08 00 00 00 00 00 00 00 00 - LEFT COMMAND KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP
											
											

Neat! Every key up or down shows us all keys that are pressed or not pressed! But what if we need to press even more keys?! Lets see what pressing "a + b + c + CONTROL + OPTION + COMMAND" looks like...

											
01 00 00 04 00 00 00 00 00 00 - "a" KEYDOWN
01 00 00 04 05 00 00 00 00 00 - "a" + "b" KEYDOWN 
01 00 00 04 05 06 00 00 00 00 - "a" + "b" + "c" KEYDOWN 
01 01 00 04 05 06 00 00 00 00 - LEFT CONTROL KEYDOWN + "a" + "b" + "c" KEYDOWN  
01 05 00 04 05 06 00 00 00 00 - LEFT CONTROL KEYDOWN + LEFT OPTION + "a" + "b" + "c" KEYDOWN  
01 0D 00 04 05 06 00 00 00 00 - LEFT CONTROL KEYDOWN + LEFT OPTION + LEFT COMMAND + "a" + "b" + "c" KEYDOWN 
01 05 00 04 05 06 00 00 00 00 - LEFT CONTROL KEYDOWN + LEFT OPTION + "a" + "b" + "c" KEYDOWN   
01 01 00 04 05 06 00 00 00 00 - LEFT CONTROL KEYDOWN + "a" + "b" + "c" KEYDOWN  
01 00 00 04 05 06 00 00 00 00 - "a" + "b" + "c" KEYDOWN 
01 00 00 04 05 00 00 00 00 00 - "a" + "b" KEYDOWN
01 00 00 04 00 00 00 00 00 00 - "a" KEYDOWN
01 00 00 00 00 00 00 00 00 00 - ALL KEYS UP 
											
											

This is more complicated so lets break it down. The letters are all there a=04 b=05 and c=06 but as they get pressed they get tacked on to the end there. This is limited in length to 6 spaces. Realistically, you can N-spaces for HID devices but this library limits us to old PS/2 standards from what I understand.

No the modifier keys are used in a really clever way. Instead of having a buffer-ish implemtation they actually get added together. So left control was 01 and left option was 04 from before but when we press them both down at the same time it becomes 05. So because this is in Hex when we add 01 + 04 + 08 we get 0D. One hex code to tell us that those left three modifier keys are being pressed! Pretty smart if you ask me!

There's definately a lot more edge cases but for what I need to do, having access to those modifier keys, this should be more than enough.

Moving forward I will need to...

  • Part One - Capture Card Processing
    • Play around with OCR to see what works best
    • lots of brainstorming...
  • Part Two - Arduino
    • Map Apple Magic Keyboard keys to their Hex codes
    • Arduino code for pass through
    • Arduino code for "Command Mode"
    • Arduino code for WIFI injection from Mac Mini
Figure 2

Future Proof, future of CUAs, and Memory limitation (03/23/2025)

It's been a few months since the last post, and the Capture Injection / Pass-through project has made significant progress. At its core, the system still relies on three main components: the Mac Studio (display output), the Mac Mini (API server), and the Arduino (command injection terminal).

Two GitHub repositories now support the project:

- AI-Observer – API server running on the Mac Mini
- ArduinoKeyBridge – Arduino firmware handling keyboard emulation and communication

An additional project is planned, likely called Arduino Mouse Bridge, and may later merge into a larger unified effort under the name ArduinoCUAB (Computer-Using Agent Bridge). This reflects the growing shift toward agent-based interfaces developed by companies like OpenAI and Anthropic, both of which are building agents that can control and use computers autonomously.

Due to rapid developments in multimodal AI agents, I’ve officially deprecated earlier efforts in optical character recognition (OCR) and visual parsing, including the use of Google Tesseract. I assume my reliance of GPT4o image recognition will eventually be depreciated as well.

What continues to justify this hardware-based approach is its flexibility and isolation. By assigning the Mac Mini as a dedicated AI/computer-using agent machine, I can interact with multiple agents simultaneously, instead of committing system-wide resources to one. This sidesteps software-level constraints and future-proofs the setup better than most purpose-built AI hardware, which tends to become obsolete fast.

The project now supports physical key-based triggers via function keys F13–F18. Each key maps to different tasks like:

- Capturing a screenshot of user's computer through the capture card
- Sending context to OpenAI
- Receiving and "typing" AI-generated output character by character
- Dumping full responses on command

This interaction is mediated by the Mac Mini’s API server, which handles all the heavy lifting. Keystroke injection enables fake typing modes for presentations or live demos, enhancing the illusion of real-time input. Memory is currently a limiting factor, and I'm optimizing the Arduino code by testing variable scope, static/global memory usage, and program size. This takes a good ammount of effort.

While I’ve explored the idea of voice interaction, it’s not useful for me at this time. However, I’ve implemented NeoPixel RGB LEDs for status display. This is especially important since the Mac Mini doesn’t always provide terminal visibility during headless operation. These visual indicators improve usability without requiring a monitor or serial monitor access.

This modular system pairs well with AI technologies and has proven far more stable than earlier experiments, such as my Google Chrome plugin and macOS app—both of which became obsolete quickly. That experience validated the pivot toward dedicated hardware.

One unresolved challenge is HDCP-compliant video capture. Many capture cards are blocked by HDCP protection. While some non-compliant cards exist (often outside U.S. regulations), they’re hard to acquire. My current low-res Amazon card technically bypasses HDCP and works for now, but I’m actively researching better solutions.

This project will continue as long as it adds value in short timeframes (e.g., three months out). Its direction may shift depending on emerging APIs or tools from OpenAI, Anthropic, or others—but for now, it provides unique utility and serves as a compelling hardware-level implementation of AI-assisted computing.

Below is a new design im concidering because CUA's are starting to become more avaiable and it make sense to adopt for this project. I can't think of any future developments from AI companies that would make my hardware solution obsolete. I assume they will essentially scale and try to bring mass adoption to these technologies for users. It would make sense for these companies to venture off to robotics and automation as they are starting with Figure robotics. Building LLM/Agents for robotics would allow me to eventually double down on a hardware CUA.

Thoughts: Luckily I am able to keep up with these technologies from a software perspective. I'm being forced to change because of the same technology that is making my project more powerful. Weird. It would be nice to eventually get this on a PCB and have a CUA that can constantly operate. What a strange would that would be. It would force users to emplore a higher level of thinking. You would need to not only be able to fix intricate details of various things but also have the forethought to concider convoluted processes.

Mac Studio
Mac Mini
Arduino® Nano ESP32

Arduino® Nano ESP32

Capture Injection/Pass-through  Flow Diagram
Neo G9
Monitor
Mac Studio
Mac Mini
Capture Card
USB-C
Keyboard
Mouse
usb-c to usb-c
Wi-Fi
lighting to usb-c
Capture Injection/Pass-through  Flow Diagram
Neo G9
Monitor
usb-a to usb-c
HDMI
USB-C
HDMI
Arduino® UNO R4 WiFi
&
SparkFun USB-C Host Shield
1080p 30 fps
Arduino® Nano ESP32

Arduino® Nano ESP32

USB-C
USB-C
Arduino® UNO R4 WiFi
&
SparkFun USB-C Host Shield
lighting to usb-c
Wi-Fi
usb-c to usb-c
Figure 3
This button does nothing but I'll use it later

General Disclaimer: The information, resources, and materials provided on this blog are intended for educational and informational purposes only. While we strive to ensure the accuracy, reliability, and relevance of the content presented, we make no guarantees or warranties, express or implied, about the completeness, accuracy, or suitability of the information for any particular academic, professional, or personal purpose.

Non-Professional Advice: The content shared on this blog does not constitute professional, academic, legal, medical, or financial advice. Users should seek the advice of qualified professionals for matters requiring specialized expertise.

Academic Guidance: While this blog provides insights, resources, and tips related to academic topics, it should not be used as a substitute for institutional guidelines, academic advisors, or educational resources provided by accredited institutions. Always consult your school, college, or university policies and faculty for academic compliance requirements.

Dynamic Nature of Information: Academic standards, guidelines, and best practices evolve over time. While we endeavor to keep our content up-to-date, we cannot guarantee that all information will reflect the most current developments in any academic field or discipline.

Plagiarism and Academic Honesty: We strongly discourage any misuse of the content provided on this blog. All users are responsible for adhering to their institution’s policies on plagiarism and academic honesty. Copying, paraphrasing, or otherwise utilizing this blog’s content in a manner inconsistent with academic integrity standards is strictly prohibited.

Attribution and Referencing: Any references, citations, or external resources provided within this blog are offered to support further study and exploration. Users are responsible for ensuring proper citation and adherence to citation styles required by their academic institutions.

Third-Party Content: This blog may link to or reference third-party websites, articles, or other resources. We do not control, endorse, or guarantee the accuracy or appropriateness of any third-party content. Users should evaluate third-party materials critically and independently.

Compliance with Institutional Rules: This blog does not represent any specific academic institution, program, or organization. Users are individually responsible for complying with the rules, policies, and ethical guidelines of their respective institutions.

Course and Assignment-Specific Guidance: Any advice or insights offered herein are generalized and may not apply to specific assignments, courses, or academic programs. Always defer to instructions and criteria provided by your course instructors or academic advisors.

No Guarantees of Academic Success: The use of this blog and its materials does not guarantee improved academic performance, grades, or outcomes. Success in academic endeavors depends on individual effort, adherence to institutional guidelines, and other external factors beyond our control.

No Responsibility for Misuse: We disclaim any liability for actions taken by users based on the information provided on this blog. Misinterpretation, misuse, or misapplication of the content is the sole responsibility of the user.

Accuracy and Corrections: If you find any inaccuracies, outdated information, or unclear guidance on this blog, please contact us so that we may address these concerns. While we strive for precision, errors or omissions may occasionally occur.

Open to Improvement: This blog values constructive feedback and is committed to fostering a culture of academic growth and intellectual honesty. Suggestions for improving the content or addressing compliance concerns are welcome.

Illustrative and Educational Intent: Any examples, scenarios, or materials provided on this blog that might appear to conflict with academic integrity or compliance standards are presented solely for illustrative or educational purposes. These examples are not intended to encourage, endorse, or condone unethical or non-compliant behavior. Users are expected to interpret and apply the information responsibly and in accordance with the academic and ethical guidelines of their respective institutions.