Guide for Causal Map Workflows
Caching Cheat Sheet 📝
What is Caching?
Caching saves results of operations so they can be reused later without recalculating everything. Think of it like saving a draft of your work!
Types of Caches in the App
🏃♂️ Local Lines Cache
- Temporarily stores results of specific commands during your session
- Makes your workflow faster by not repeating calculations
- Clears when you close the app or click "Clear Cache" (first yellow broom)
📝 Local Cache: Named Results
- When you use
store xyzandrecall xyzcommands - Saves specific results you want to reuse further down in your workflow with
unstoreorrecall - Only available during your current session
💾 Remote Lines Cache
- Saves results across sessions
- Whenever we save to the local lines cache, we also save to the remote lines cache
- When you come back tomorrow, previously calculated results are still available
- Helps avoid recalculating common operations
- Remote cache is fast, but local cache is faster (which is why we have it)
🤖 Prompt Cache
- Saves responses from AI prompts
- Helps avoid making the same AI calls multiple times
- Especially useful for expensive or time-consuming AI operations
- Everything has to be the same: preceding table and prompt: but statement numbers can differ
- If you run a prompt for table T for statements 1-10, then again for statements 1-20, the cache for the first 10 lines will be re-used
🧮 Embeddings Cache
- Stores embeddings for text analysis which we have to "buy" from openAI
- Used for similarity searches, clustering and magnets
- Faster and cheaper than ordering the embeddings again from openAI
⬆️ Uploads
- Not really a cache because you need to use it manually.
- You can use it to save intermediate results like the result of coding
upload project|foouploads the previous result under the keyproject|foo- add
overwrite=trueif there is an existing result under the same key - Tip: Delete the
overwrite = trueafterwards
Magic fences
Code within magic fences is completely ignored at all times UNLESS you click a row inside the fences, in which case everything runs as if the fence above was not there.
we should show a confirmation alert if someone clicks a gutter WITHIN a fenced block; they have to confirm yes before the gutterclick is sent to backend. this is a purely front-end trick.
Keys
- Each cached result has a "key" or "label" which we need to fetch the result from the cache
- For the local and remote 'lines' caches, the key is all the commands up to the current line stuck together into one long key, but deleting any blocks which are irrelevant (does not store anything we need). So if you change an irrelevant line above the current line that makes no difference and the key stays the same and we can re-use the cache
- For the prompt cache the key is the table row and the prompt
- For the embeddings cache the key is just the text for which we want the embeddings
Clear Local Cache Button (with broom icon) ⚡
- Just deletes all local cache (named and lines results) but not remote.
Clear Current Line Cache Button (with eraser icon) ⚡
- Deletes local AND remote cache for current line (the blue-grey line where the CURSOR is in the editor, not necessarily the same as the green last clicked line). Useful e.g. for a line line like
download ....when you have updated the upload in the interim. NEW: works also if user makes a multi-line selection: for EACH line in the selection, delete local and remote cache.
Cache Toggle Button ⚡
Located next to the "Clear Cache" button: - ON (default): Uses cached results when available (both line-based caching and named results from store/recall) - OFF: Recalculates everything fresh, ignoring both automatic line-based cache and named results cache - You would only ever use OFF if you think the old results are wrong. This can only happen when: - The app code has been updated - Your commands work on uploaded files or CM3 files that might have changed - You need to refresh stored/named results (from store/recall commands)
With the cache toggle OFF, the behavior for named results (e.g., from store xyz and recall xyz) is as follows:
- Any named results stored before the current processing run (or before the toggle was set to OFF) are ignored.
- However, if a named result is created by a store command during the current processing run (while the toggle is OFF), it is available for subsequent commands like recall or append within that same run. These freshly created named results are held temporarily in memory for the current run only.
- For line-based caching, when OFF, results are always recalculated (retrieval is skipped), but these fresh results are still saved to the cache (local and remote), overwriting existing entries.
Tips 💡
- The cache automatically knows when to update if you change your commands
- Use "Clear Cache" if you want to start fresh
- Caching helps make your workflow faster and more efficient
Quick Start Guide
What is this app for?
This app helps you analyze text data using AI and create different kinds of outputs like causal maps, tables, charts, etc.
The Interface
The app has three main areas:
- Editor Panel (left): Where you type commands and text
- Output Panel (right): Shows results of your commands
- Gallery (within the right tab): Access your saved work and templates
You can resize the panels by dragging the dotted divider between them.
Basic Workflow
- Type commands in the editor panel
- Results appear in the output panel
- Commands are processed line by line - each line builds on the previous result
- No need to save the results, they are cached and loaded again quickly on the basis of the text in the editor: you don't save the results, you just save the text.
Workflows is Verb-Centric
At the heart of Workflows is a set of verbs that describe kinds of data transformations that are applied to data-tables. These verbs can be chained together to form data workflows.
Types of Commands
Inputting tables (resulting in a table)
- links myfile # Load a links file from CM3
- statements myfile # Load a statements file from CM3
- download myfile # Load a table from Workflows and use it directly.
- scrape myURL # Scrape a URL and make its contents available as a table.
- fetch qualia # Get Qualia interview transcripts
Outputting visualisations (inputting a table and outputting a visualisation)
- map # Create network visualization
- chart # Create a chart
- print # Formatted text and images
Transforming data (inputting a table and outputting a table)
All the other commands are like this, e.g.
- select col1 col2 # Choose specific columns
- filter col = value # Filter rows
- prompt # Use an LLM prompt to transform the 'text' column in the previous table and output an new table
Remember
- Commands are processed top to bottom
- Results are cached so they load quickly when reused
- Add comments with # to document your workflow and create a presentaion which you can share
- Save versions of your work regularly
Getting Help
- Click the ? icon for an interactive tour or press alt-h for a help modal
- Check the syntax guide for command details
- Look at example templates in the Gallery
when cursor is in a word in the editor and user presses alt-h, the help modal appears with that word pasted into the search field.
Next Steps
Once you're comfortable with the basics, you can:
- Upload your own PDF files for analysis
- Use AI prompts to analyze text
- Create presentation slides
- Share your work with colleagues
- Export results to PDF
Keyboard shortcuts
- ArrowUp: Toggle the sidebar and main navbar visibility (ignored if typing in an input field).
- Left Arrow / Right Arrow: Navigate presentation slides (previous/next) only when the sidebar/navbar is hidden (ignored if typing in an input field).
Authorisation and authentication
The app uses Firebase authentication for login and logout, using a firebase_user_id.
Note the "sess_id" is the session-based sess_id, not the firebase_user_id.
Text/Workflow Authorisation
Access to texts (workflows, snippets) is now managed via the database, not the first line of the text file.
- Owner: Each text has an
owner(an email address), set when the text is first saved. The owner always has full access. - Editors: Each text has an
editorsfield in the database (a comma-separated string of emails). Users listed as editors have access to the text. - Viewers: Each text also has a
viewersfield (comma-separated emails). Users listed as viewers can load and view the text, but cannot save changes or manage editors/viewers.- You can add the special value
globalto theviewerslist to make the text publicly viewable by anyone (logged in or not).
- You can add the special value
- Management:
- The owner, administrators, and any current editors can manage the list of editors and viewers.
- A "Manage Editors" pill () and "Manage Viewers" pill () appear on the gallery card only for users with permission (owner, editor, admin).
- Clicking these pills opens a modal to add or remove editor/viewer emails.
- Gallery/Loading: You will only see texts in the gallery or be able to load texts via URL if you are:
- An administrator (defined in
config.py). - The owner of the text.
- Listed in the
editorsfield for the text. - Listed in the
viewersfield for the text, OR ifglobalis in theviewerslist.
- An administrator (defined in
- Saving/Editing: Only administrators, owners, or editors can save changes to a text (e.g., using the "Save Project Version" button or editing via the gallery modal).
- Ownerless Texts: Texts created before this system might have no owner assigned. These are only visible to administrators.
- DEPRECATED: The old method of using
share editing: ... viewing: ...on the first line of the text is no longer supported.
Only provide file names for which the logged-in user is authorised, i.e. in which their email appears in the comma-separated lists of emails in the edit, copy or view fields in the corresponding row in the files table.
Versions (applies to any text, not just Projects)
- Every text saved at the texts sql table can have one or more versions.
- Each version has a version number, starting from 1 for the first version, e.g. foo-v1.
- The version number is updated for each save.
- A text name with no version number is taken as a pointer to the latest version.
- This way, we can refer users to either the most recent version (which may constantly update) or a specific version (e.g. foo-v123, which is fixed).
Projects and Project versions
A Project aka workflow aka script is any text in the text database which calls data, see below.
special case: No title
User may delete the title in #projectTitle in which case there is no current Project until User saves under some name.
Normal case
Project titles will often contain | separators eg foldername|foo-v22. These are just strings, not alternatives or lists. The part before the separator is considered a bit like a folder name in the Gallery, but treat it mostly like any other character.
A Project is a text saved in the texts sql table which also has a versioned title like foobar-v12. The current Project should always be the (versioned) Project title in #projectTitle and the contents of the Editor start out as the corresponding text which User then edits and can save again under a new version number. I think previously we also allowed unversioned titles as equivalent to like version zero but we don't want this, the first version should already be foobar-v1.
When user presses the saveProjectVersion button: - if projectTitle is empty, throw an error - if the projectTitle is not empty, whether or not it ends in -vNN where NN is an integer: - always find the max NN and increment by 1 to give the new version number. - save the current text to the texts table - with the new projectTitle-vNN as the title - copy the other fields (including colour fields and font field) apart from modified which will auto update and text_id which will be updated to the new id - replace the contents of projectTitle with the new projectTitle-vNN.
clicking Clear Title $el.previousElementSibling.value = ''; projectTitle = '' should not only remove the title from #projectTitle but also from the URL ?title= parameter.
URL parameters and project versions
-
the URL changes live if you change the projectTitle: &title=... but without the version number: &title=foo not &title=foo-v123. It also always reflects the current line number, i.e. the most recently clicked line in the editor, in the &line= parameter.
-
If you have a non-blank projectTitle, then the URL will NOT be updated to reflect the editor text.
- If you have a blank projectTitle, then the URL constantly updates as you alter the input text . This means when you have a blank projectTitle, the URL will update to reflect the editor text and the cursor position. If you reload or send the URL to someone else and they will see the same text and cursor position. If you have a nonblank projectTitle, the page will find the latest version of the text with that title and load it into the editor. If you have a versioned title
You can also specify the text, title (optionally with version number) and line number directly in the URL: - title=...This is the title of the text in the database. If no version number is present, we take the latest version. The version is loaded into the projectTitle text input, and the corresponding text is loaded into the editor. - text=...only used if the title is not present. - line=...This is the line number to move the cursor to and simulate gutterclick on this line
Tests: - one - two - should load longer text from texts table into input and load testAll into projectTitle
Closing or refreshing browser
If the button saveProjectVersion has class needs-save (or similar) AND the #projectTitle is not blank, we should pop up the browser alert "Changes that you made may not be saved." .
Saving texts from the gallery
if we click editText(item) @gallery.js and then the saveEdit button we get a new version and we also have another Save Without Versioning button next to it Which simply updates the existing text without creating a new version.
Clicking @click="saveProjectVersion" should save the current text to the texts table with the contents of #projectTitle as the title, updated to append-vNN where NN is the max of the existing versions plus one (get the title from #projectTitle, not the URL), copy the other fields including colours and fonts apart from modified which will auto update and text_id which will be updated to the new id.
Editing texts from the gallery - should NOT make any changes to #projectTitle or the editor text. - should NO LONGER copy the URL with ?title=projectTitle-vNN&line=[current line number] to clipboard.
- the content of projectTitle is empty, throw an error.
- if the content of projectTitle is new and has no -vNN, that's fine, just save as projectTitle-v1.
Gallery
In a separate html template, included in index.html, we have a gallery of favourite texts. This is a flex container with a card for each text, showing by default the first 30 cards, with filters and server-side population.
The cards are basically one for each row in the texts table, except that for projects (sets of texts like foo-v1, foo-v2, etc) we show the project title and only the latest version number.
A card is a project if it contains any lines starting: - get - fetch - download - links - statements - sources - scrape
By default we show the most recent 30 unarchived texts.
each card shows - the project title, - the file name(s) (extracted from any 'get someFileName' command in the text), - the number of lines, - the first few characters of the first two lines, - the date last modified, - a checkbox showing and controlling the status of the archived field - a button to insert the corresponding text into the Ace text editor. - if the text is a project, insert the latest version, replacing all the text in the editor, and unset .needs-save - if the text is a not a project, insert the text at the current cursor position, and set .needs-save - a button to insert the corresponding title into the Ace text editor in square brackets. - a button to open a modal to edit the text. - widgets for colours and font
Loading texts from the gallery
When we click @click="$store.galleryActions.insertText(item)" the editor text is replaced with the text from the template. - #projectTitle should also be updated and the URL should be updated only with this title, not also with the text parameter. The "needs save" attribute of the save button should be reset. NEW: also simulate a click on the output tab ({ activeRightTab: 1 }) to switch from Gallery to Output, and NEW simulate gutterclick on line 1
NEW: do the same (switch tab to Output) when we click on $store.galleryActions.insertTitle(item)
Template Tag Version Updating
When you click on a template tag in square brackets (e.g., [mytemplate-v5]) in the editor, it opens the gallery edit modal for that template. If you make changes and save a new version, the system automatically updates the template tag in the editor to reflect the new version number.
How it works:
- Click on any template tag like [genai-labels-v25] in the editor
- The gallery modal opens with that template loaded
- Make your edits and click "Save Edit" to create a new version (e.g., genai-labels-v26)
- The original template tag in the editor is automatically updated to [genai-labels-v26]
- This works whether the gallery component is initialized or not - the system includes fallback mechanisms for template editing
This ensures your workflow always references the latest version of templates you've edited, without requiring manual updates to template tags throughout your code.
Search filter:
cards should be sorted like this: first list cards with a text hit in the title, in descending order of modification date. then list cards with a text hit in the text, in descending order of modification date. also take into account the checkbox to filter by archived status and a checkbox to show all versions or only the most recent version (by default: most recent).
If the button saveProjectVersion has class needs-save (or similar) we should pop up the browser alert "Changes that you made may not be saved." as we would if closing the browser.
Project cards which are normally blue now get two colour selectors. give me sql to add the fields colour1, colour2 and font to the texts table. to project cards, add two new selectors for colour1 and colour2 which provide colours. these are fed from and saved to the texts table. default is black. show colour1 as the card colour and colour2 as the border.
Authorisation of texts
we add a widget to each card which allows us to directly edit the editors field in the sql table. this exists already, varying(128). allow users with edit permission to add or change valid email addresses using the card. Use this field instead of the first line of text to decide if cards are displayed in gallery or not.
... then, in the gallery, via the endpoint, only list texts where current user admin or is owner or one of the editors.
Folders in the Gallery
folders are simply all the texts like foo for which there are texts with title foo|bar, foo|baz, etc.
so in the gallery, also print at the top a list of "folders", i.e. all such texts, in a flex container with a small button for each folder, which when clicked loads that text including the trailing | into the search filter.
That means if you want to put a text into a folder, you just type "foldername|" first. Afterwards, versions of the texts will remain in the same folder.
in the list of folders, we put the folders "examples" then "tools" first and make the icons green instead of blue
after the "show all versions" checkbox we add a radiogroup with options Scripts Snippets and Both, with default Scripts. If scripts, show only cards with files (the blue ones), if Snippets show only non-file cards (the whilte ones), etc
Parsing / processing the input text
The input text is processedline by line.
So line 1 produces a result, line 2 operates on that result, etc.
The results are cached locally, so they should appear quickly when used again.
The user should get hints on what to type, see below.
If the result of a current command is a table, the parser will return the first 50 rows and we will collect that with a front-end Tabulator table with client-side operation. That means the table is not loaded into memory, so be careful because sorting headers only sorts the current page.
Desired pass-through behaviour (not implemented yet)
atm most commands take an input table in @process_parse_request , process it and pass it onto the next command. as a byproduct they also send the output to the frontend as a table for visualisation.
the map and print commands are exceptions because they pass a differnet kind of output, and do not pass on a tabular result to the next command in the pipe. this is not convenient, i wish they could just pass the previous table untouched on to the next command, while still sending their special outputs to the frontend as before.
I think this would be easy to do by storing and passing on the previous table in the parse loop. suggest how to implement this.
magic fences
Magic fences are blocks within pairs of three backticks that are ignored. This is useful for including code in the input text without it being processed, for example when you have run a long and complicated prompt and want to then upload the result to the datastore.
ignore this
399950030 so code changes within or before the magic fences are not processed.
But the magic is that if you click within the magic fences, the code will be processed as usual. That includes all code before the magic fences (as usual, respecting existing caching as usual, and all the preceding code within the magic fences as usual). The results are cached as usual, also for any subsequent clicks within the magic fences. This is because, if you think about it, if you click within the magic fences, the app can't see the closing magic fence backticks, so it doesn't treat this request as being from within magic fences at all.
Recipes
We introduce hard-coded recipes. When a user enters app with no &text= or &title= parameter, we will commence a frontend-based workflow. Open the upload modal which has for now two buttons: Outcome Harvesting and "Other". Clicking one activates a recipe workflow. For now, we only have one recipe: OH-PDF On clicking the OH button, we hardcode the upload filename to "[sanitised-user-email]|OH-PDF--nn (nn is an incremental integer, to ensure the uploads name is unique) and disable editing this name. - We also show a custom text within this modal which says "Upload all your PDFs at once and I will identify Outcomes within them". - When the upload has finished, update the editor (using existing functionality to change editor text) like this:
list uploads
download OH-PDF-[sanitised-user-email]-nn
Output
Table output (default)
Numerical columns have conditional formatting of cell backgrounds based on cell value, with darker backgrounds for higher values, muted red for negative values, muted blue for positive values, transitioning gradually through white for zero with square-root scaling. colours are scaled such that the darkest colour is used for the largest absolute value. Scaling is on a per-column basis.
Numerical non-integer columns show max 2 signficant places (so usual 2.1 is formatted as 2.1, 2.22 is formatted as 2.2, 2.256 is formatted as 2.26, but 0.001 is formatted as 0.001).
text results and outputs
I want to unify the different kinds of text results and Output, for example I want:
[]
Commands and syntax
Getting data, data management
⚡ list mongo OR uploads OR cm3
ignoring case, so list uploads is the same as list Uploads, list MonGO is same as list mongo, list cM3 is same as list CM3 etc
It is just a convenience. You can use the other commands e.g. to filter this list, but it is not part of the workflow / pipeline.
list uploads [cache=False]
this loads a table of all the objects in the S3 store saved with upload command and retrievable with the download command along with any metadata we can get from the S3 store. Initially listed in descending order of upload date.
NEW by default, we do NOT list lines in the "directory" cache/lines_cache/, unless cache=True
list cm3
an additional standalone command on any line:
this loads the complete list of cm3files analogous to the get command
You can use the other commands e.g. to filter this list, but it is not part of the workflow / pipeline.
⚡ scrape foobar.com [depth=nn]
uses the package scrapy to scrape the web page at foobar.com with minimal extra parameters and load the data into a dataframe called statements analogous with "get".
depth parameter (default 0):
- depth=0: Only scrape the initial URL (default behavior)
- depth=1: Follow all links on the page which are in the same (sub-)domain and scrape those too
- depth=2: Continue this process for one more level deep
The command adds columns for depth (showing the level at which the content was found) and url (showing the specific URL where each piece of content was scraped from).
The scraper provides notify_info feedback showing scraping progress. If the number of followed links exceeds 200, the process stops with a notify_warning to prevent excessive resource usage.
The scraper only follows links within the same domain or subdomains (e.g., if you start with example.com, it will follow links to subdomain.example.com but not to other-site.com).
Example: scrape example.com depth=2 will scrape the main page, all linked pages on the same domain, and then all pages linked from those pages.
Prevents several issues: - Circular references: If page A links to page B, and page B links back to page A, we won't get stuck in an infinite loop - Duplicate content: Each URL is only scraped once, even if multiple pages link to it - Multiple paths to same page: If the same page can be reached via different link paths, we only process it once - The implementation uses a set() to track processed URLs, which provides O(1) lookup time for checking duplicates. This is efficient even when processing the maximum of 200 URLs. - So the scraper correctly handles scenarios like: - A → B, B → A (circular) - A → C, B → C (multiple paths to same destination) - Complex link networks where pages reference each other - The depth tracking still works correctly because we track the depth at which each URL was first encountered, ensuring the depth column accurately reflects how the content was discovered.
⚡ get
get foobar, if the file foobar exists it will be loaded as long as it is shared with the user.
links, statements and sources are the names of the three tables making up a file.
The three tables are loaded and saved locally in memory to a datastore, ready to be used.
joins
At the end of load_file_data(), left join the sources table to the statements table by source_id. this must not result in extra rows as source_id must be unique in the sources table. If there are any other duplicate colnames, throw a notify_warning() and leave out the column from sources.
finally, left join the statements table to the links table by statement_id. this must not result in extra rows as statement_id must be unique in the statements table. Again, skip duplicate columns
if there are no columns to join on, create all combinations. So if x is a df with m rows and y is a df with n rows, then the result will have m*n rows and all combinations of rows from x and y.
⚡ links statements sources
The three dataframe commands: links, statements and sources, as well as the files command, can be used anywhere in the input and will load the dataframe with that name from the datastore.
These three commands are used on their own, and the result is the table with that name.
links foobar, if the cm3 file foobar exists, this returns the links table as a dataframe .statements foobar, if the cm3 file foobar exists, this returns the statements table as a dataframe.sources foobar, if the cm3 file foobar exists, this returns the sources table as a dataframe.
links, statements and sources are the names of the three tables making up a CM3 file.
We add this convenience functionality to the parser:
If the command is just links, statements or sources, then we look back to see if the actual cm3 file has been mentioned earlier in the input in a links, statements or sources or get command. If so, we use that, otherwise we give an error. If there is more than one such mention of an actual cm3 file, we take the most recent one.
So
get foobar blah blah blah links
is the same as
get foobar blah blah blah links foobar
and
statements foobar blah blah blah links
is the same as
statements foobar blah blah blah links foobar
and
get foobar blah blah blah statements baz links
is the same as
get foobar blah blah blah statements baz links baz
⚡ factors
Factors are not a table like links, statements and sources, but are actually a view of the links table: what are the unique values of from_label and to_label? So the command only works if the current table has columns called from_label and to_label and source_id and link_id.
It outputs a table with columns
- label,
- source_count,
- link_count
- source_ids
- link_ids
- statement_ids
It takes the from_label and to_label columns and combines them into a single column called label, one on top of the other, then groups by them, counting the number of source_id and link_id rows for each.
It also adds the columns source_ids and link_ids and statement_ids which are the unique values of source_id and link_id for each group.
If the column from_label is not available, use Cause or cause if available. Similarly for to_label: if not available, use Effect or effect if available. And if statement_id is not available, create a serial number but as character. And if source_id is not available, use statement_id.
fetch qualia [table=statements] foobar
This is a convenience command that loads the file foobar as a statements dataframe from a different RDS database:
conn = await get_db_conn()
we need the tables called "interviews" (for auth information) and "answers" (for the data) in the database.
It's a postgresql database at heroku.
fetch qualia foobar
loads the answers table where the field interview=foobar as a dataframe, providing the user has the correct auth information in the interviews table, providing the current user's email is = the owner field of the corresponding row in the interviews table or the edit field (comma separated list of emails) contains the user's email.
if the table is not statements, then fetch the appropriate table from the database:
- sessions
- interviews
- translations
⚡ share
share and unshare are used to control access to objects (usually tables) stored at s3. these objects also have a key called owner_email which is used to control access. owner_email is set to the current user's email when the object is uploaded and cannot then be changed. only owners, editors or admins can access the object or change the edit field.
share foobar friend@example.com[, friend2@example.com]
this command will add the friend's email(s) to the edit field of the file foobar (comma separated list of emails).
Only editors or admins can do share or unshare.
⚡ unshare
unshare foobar friend@example.com[, friend2@example.com]
this command will remove the friend's email(s) from the edit field of the file foobar (comma separated list of emails).
⚡ upload objectname [overwrite=true]
very simple command to upload the previous table or other objectto the S3 store, with name objectname.
only if overwrite=false will it fail if the name is already used, otherwise it will silently overwrite an existing object.
uploading with overwrite=true will only work if the current user is the owner or an editor of the object.
the upload command also adds the current user's email to the object's owner metadata.
⚡ download objectname
very simple command to download an object from the S3 store, with name objectname. For permissions, see share command above.
⚡ delete [objectname] OR [N=nn]
if objectname is given, if the current user is the owner or an editor of the object, or an admin,delete the object from the S3 store (a version is still saved in the s3 versioning system).
if no objectname is given, read the Key column of the preceding (presumably filtered) table and delete each of those objects from the S3 store for which the current user is authorised as above.
HOWEVER when using a table, we must also provide an integer N which is the number of rows in the table to delete (i.e. not the rown numbers but the total number of Keys beging passed). If this number is wrong, the whole command will fail. This is a way to make sure you know what you are deleting!!
i.e.:
- IF N=nn is given, look for a preceding table and if it has nn rows delete each of the keys in the 'Key' column.
- IF no N is given, but a Key not beginning "N=" is found, delete that key.
- else, throw an error.
⚡ store xyz / recall aka unstore xyz
(recall is the same as unstore.)
store: This command goes on its own line. it simply means that the backend adds the current result (the result of the previous command) to the in-session cache with the tag xyz.
Then when it sees code like this:
| [other commands] | command 1 | store xyz | [possibly other commands] | command 2 | recall xyz | command3
that means that command3 is run with the result of command1 and the output of the last line is the same as if we had written: command 1 command3
they should not persist independently of the command pipeline.
foo store foostore bar recall foostore
if foo changes, the result of recall foostore should change.
Dataframe manipulation: Filter, select, etc
⚡ groupify
store baz # some previous result
store bar # some previous result
foo-df # df on preceding line
groupify bar baz # NEW SYNTAX a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command
If there is no preceding df, like this
store baz # some previous result
store bar # some previous result
print # the result of this is not a df
groupify bar baz # NEW SYNTAX a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command
(old syntax was groupify(baz,bar) rather than new: groupify baz bar)
⚡ ungroupify
Extracts a numbered element from a grouped output created by group by or groupify.
store baz # some previous result
store bar # some previous result
foo-df # df on preceding line
groupify(bar,baz) # a grouped output consisting of three objects foo-df, bar and baz just like the result of the `group by` command
ungroupify 2 # returns baz
⚡ select
selects columns from the current table. eg:
get foobar
statements
select colname1 colname2 colname3
We also allow selection by position and range. So if the colnames are a b c d e f g h i j:
select 2-4 e f 8
will select columns b c d e f h.
Note that the select statement also reorder columns in the same order. So:
select bar foo baz
will return a table with columns in the order bar, foo, baz.
Select also allows renaming columns. So:
select colname1 into foo colname2 into bar
will return a table with columns foo and bar.
This also supports quotes in both positions:
select "colname1" into foo colname2 into bar
will return a table with columns foo and bar.
select colname1 into "foo" colname2 into "bar with space"
will return a table with columns foo and "bar with space".
Also allow minus to exclude columns:
| select -1-3 | select -colname1 -colname2
... returning all the columns except the first three.
and
| select -colname1 -colname2
... returning all the columns except colname1 and colname2.
⚡ index
Also enable the index command:
index foo by bar
All this does is add the name and value of the grouping column to the output. given a row with bar=i1, foo=c1, the output will be:
((bar: i1)): c1
⚡ rename
rename colname1 into foo colname2 into bar
renames the columns colname1 and colname2 to foo and bar respectively.
Like the select command, This also supports quotes in both positions:
rename "colname1" into foo colname2 into bar
and
rename colname1 into "foo" colname2 into "bar with space"
etc
NEW enable numerical col specification like in select e.g.
rename 1 into foobar
⚡ union newcolname = oldcolname1 oldcolname2 ...
This takes the current table with columns including oldcolname1, oldcolname2, ... and replaces those columns with a new column called newcolname with the concatenated values of oldcolname1, oldcolname2, retaining existing columns not mentioned.
The concatentation is done like this: oldcolname1 oldcolname2 value1 value2
results in: newcolname ((oldcolname1:)) value1\n((oldcolname2:)) value2
⚡ split column by charstring
Returns the same df except that for any row where charstring appears in column n times, we split that text by charstring into n+1 remainders, resulting in n+1 rows which are identical except that column contains each of the n+1 stripped remainders.
eg
split text by //
⚡ distinct
inputs a df and outputs completely distinct rows from the df
Returns only unique rows from the current table, removing any duplicates. Usage:
get foobar
statements
distinct
This command takes no parameters and simply removes duplicate rows from the current table.
⚡ sort colname1 [colname2...]
Sorts the current table by the columns colname1, colname2, etc.
⚡ sortdown colname1 [colname2...]
Sorts the current table by the columns colname1, colname2, etc, in descending order.
⚡ pivot colname1 by colname2 [show colname3]
Pivots the current table so that colname1 is in the rows, colname 2 is in the columns, and the value in the cell is are the values of colname3.
If the combinations of the values of colname 1 and colname 2 are not unique, then the values of colname3 are concatenated, separated by a ////.
If show colname3 is not specified (most useful when the combinations of the values of colname 1 and colname 2 are not unique) the values in the cells is just the number of combinations.
⚡ filter
contains, is
Filter for rows where the column contains or is the value.
= is a synonym for is. It does not need to be surrounded by spaces.
more, less
Operators > and < aka 'more' and 'less'. Treat values as numbers if possible, otherwise filter always excludes the row.
OR works as expected.
< and > do not need to be surrounded by spaces.
OR
now we also generalise the filter command : 'filter colname is foo OR bar' filters for colname to be EITHER foo OR bar and also 'filter colname contains foo OR bar' filters for colname to EITHER contain foo OR bar
(note that the OR is in uppercase)
contains_not, is_not
Add operators contains_not and is_not. In this case we filter for rows where the column does NOT contain or is NOT the value.
In these cases, "contains_not foo NOR bar" retains rows where the column contains neither foo nor bar.
"foo bar baz" is discarded.
"foo baz" is not discarded.
m-n, n
'filter m-n' means retain only rows m-n of the current table.
NEW: 'filter n' means retain only the nth row of the current table.
random n
NEW: 'filter random n' means retain only a random sample of n rows from the current table. use a seed to make it reproducible.
quoted values
Added proper quoted value parsing that:
Preserves spaces within quotes Handles OR correctly even with quoted values Ignores quotes and OR within quotes
Strips whitespace from values appropriately Handles empty or invalid values gracefully Preserves case within quotes but does case-insensitive comparison
Now these will all work correctly: links filter text contains "a long text with spaces"
statements filter statement_id contains "id with spaces"
statements filter text contains "first text" OR "second text"
links filter from_label contains "label one" OR "label two" OR "label three"
New parameter: filter colname resembles "a text with spaces" [similarity=0.3] filter colname resembles keyword [similarity=0.3]
example: filter text resembles "a text with spaces"
resembles
filter from_label resembles family similarity=0.5
this returns rows where the colname column resembles the keyword with the given similarity threshold, using the cosine similarity metric. So this needs to fetch the embeddings for the keyword and the colname column, and then compare the two vectors using cosine similarity. We will use the new openAI text-embedding-3-large model for the embeddings. The request is like this:
from openai import OpenAI client = OpenAI()
response = client.embeddings.create( input="Your text string goes here", model="text-embedding-3-large" )
print(response.data[0].embedding)
I also want to cache the embeddings in the datastore, so that we don't have to fetch them from openAI every time. I already have mongodbatlas running.
top
links
filter top n links [type = source] filter top n factors [type = source]
type is optional, default is source.
filter top n links type = source means: get the current links table, group by bundle, order by source_count in descending order and return the top n rows. If there are ties, include all ties. This means that if bundle A >> B has 10 sources and bundle C >> D has 10 sources, and n is 10, then both A >> B and C >> D will be included, even if this means identifying more than n rows. then return all links from the current links table which are in these bundles.
filter top n links type = links Same as above, but use link_count instead of source_count.
factors
filter top n factors [type = source] filter top n factors [type = source]
type is optional, default is source.
filter top n factors type = source means: get the current links table, create a factors table from it, listing unique labels only, group by label, order by source_count in descending order and return the top n rows. If there are ties, include all ties. This means that if label A has 10 sources and label C has 10 sources, and n is 10, then both A and C will be included, even if this means identifying more than n rows. then return all links from the current links table where BOTH from_label AND to_label are in this list.
filter top n factors type = links Same as above, but use link_count instead of source_count.
UPDATED implicit support for reckon (replacing any previous support for calculate)
beyond the simple cases shown previously, we also use the reckon command to filter for more complex conditions.
filter source_count==link_count
filter source_count<=(link_count+1)
filter contains(as_character(id,"x))
etc.
⚡ join [table]
left join the current table to the table specified by the table= parameter. this other table can be one of links, statements, or sources, or it can be a table in the store set up by the store command. the two tables must have exactly one column in common, which will be used to match rows. the result is a new table with the columns of the current table plus the columns of the table specified by the table= parameter.
⚡ append
this can append text to text or tables to tables, so we no longer need the table= or text= keyword.
if one is table and one is text, give a notify_error() and return and error.
if both are tables:
append the current table to the table specified. this other table can be any table in the store set up by the store command.
the columns of the new table will be the intersection of the columns of the two tables.
- if column names only differ by case, they are considered identical and the actual column name from the stored table is used. A notification is given to the user.
- if two columns have the same name but differ in type, the type of the stored table is used. A notification is given to the user.
if both are text:
append the current text to the text specified. this other text can be any text in the store set up by the store command.
the result is just the text concatenation of the two texts, with each text preceded by a markdown h1 heading with part1, then part2. If part2 is not given, replace it with the name of the stored result, otherwise if either part1 and part2 are not given, use the texts "Part 1" and "Part 2" respectively.
optionally we can specify names for the json:
append stored-foo name1=Name for the current text name2=Name for the stored text
example
links example-file
filter 1-10
recalculate
store lkss
factors
print
store printfactors
unstore lkss
print
append name1=printfactors name2=Some name
this outputs a text which is the concatenation of the two printed versions of the two tables, with the initial heading "# Some name" and the second heading "# printfactors". Note quote marks are not required, but are tolerated and stripped.
⚡ reckon
similar to calculate.
All of these work.
All these operations work because the reckon command includes all NumPy functions and Python's arithmetic operators in its evaluation context, making it very flexible for mathematical calculations.
Column-wise
reckon max_row = max(row_number) reckon avg_length = mean(label_length) reckon total_rows = count(row_number) reckon unique = count_unique(from_label) reckon lister = list(from_label) reckon spread = spread(from_label) reckon baseline = baseline(from_label)
sum(column): Sum of valuesmean(column): Average of valuesmin(column),max(column): Minimum/maximum valuemedian(column): Median valuefirst(column),last(column): First/last value in current sort ordercount_unique(column): Count of unique valueslines(column): Double-newline-separated string of valueslist_unique(column): Comma-separated string of unique values
Conveniences
reckon row_num = row_number reckon nrow = tally # just an alias for `reckon N = count(row_number)
row-wise
Basic arithmetic
reckon double = row_number * 2 reckon square = row_number ** 2 reckon increment = row_number + 1 reckon half = row_number / 2 reckon multi = (double+square)/(row_number*2)
String functions
reckon upper_label = upper(from_label)
reckon lower_label = lower(from_label)
reckon label_length = length(from_label)
reckon contains_a = contains(from_label, "a")
reckon starts_with_t = startswith(from_label, "F")
reckon matches_pattern = match(column_name, "^regex_pattern.*") # produces the actual matching strings, not a boolean
reckon has_pattern = contains(column_name, "regex_pattern", regex=True)
reckon match = match(from_label, "F.*")
reckon alphanum = match(column_name, "[a-zA-Z0-9]+") # Extract any sequence of letters/numbers
reckon letters_only = match(column_name, "[a-zA-Z]+") # Extract only letters
reckon numbers_only = match(column_name, "\\d+") # Use double backslash for digit char class
reckon digits_only = match(column_name, "[0-9]+") # Alternative without backslash
String manipulation
reckon replaced = replace(from_label, "a", "X") reckon first_char = substring(from_label, 0, 1) reckon without_first = substring(from_label, 1) reckon three_chars = substring(from_label, 0, 2)
Combination of operations
reckon complex = upper(substring(from_label, 0, 1)) + lower(substring(from_label, 1)) reckon ratio = length(from_label) / row_number reckon is_special = length(from_label) > 10 and contains(from_label, "I")
Boolean operations
reckon exact_match = from_label == "Income" reckon not_match = from_label != "Income" reckon is_large = row_number > 10 reckon is_small = row_number < 5 reckon is_medium2 = row_number >= 5 or is_special reckon is_medium3 = row_number >= 5 and row_number <= 17
Negation (using NOT)
reckon is_not_large = not (row_number > 10) reckon doesnt_contain = not contains(from_label, "test") reckon isnt_empty = not (length(from_label) == 0)
Using numpy functions
reckon log_value = log(row_number) reckon sqrt_value = sqrt(row_number) reckon rounded = round(row_number / 3, 1) reckon corr = correlation(row_number,sqrt_value) # returns columns corr_coefficient, corr_sd, corr_p
Type conversions
reckon as_string = as_character(row_number) reckon numeric_length = as_numeric(length(from_label))
Conditional operations.
reckon size_category = row_number < 5 ~ "Small", row_number < 10 ~ "Medium", else "Large"
reckon size_category = row_number < 5 ~ "Small", row_number < 10 ~ "Medium"
seems like this successfully leaves nonmentioned values of row_number untouched?:
reckon row_number = row_number < 5 ~ "Small", row_number < 10 ~ "Medium"
More maths
reckon remainder = row_number % 2 # Basic modulo reckon is_even = (row_number % 2) == 0 # Check if a number is even
Exponentiation
reckon cubed = row_number ** 3 # Cube of row_number reckon power = pow(row_number, 3) # Using pow function
Square root and other roots
reckon square_root = sqrt(row_number) # Square root reckon cube_root = row_number ** (1/3) # Cube root
Logarithms
reckon log_value = log(row_number) # Natural logarithm reckon log10_value = log10(row_number) # Base-10 logarithm
Rounding
reckon rounded = round(row_number / 3, 2) # Round to 2 decimal places reckon floor_value = floor(row_number / 3) # Round down reckon ceil_value = ceil(row_number / 3) # Round up
Absolute value
reckon absolute = abs(row_number - 3) # Absolute value
Trigonometric functions
reckon sine = sin(row_number) # Sine reckon cosine = cos(row_number) # Cosine reckon tangent = tan(row_number) # Tangent
Available column-wise functions requiring a column argument:
Special column-wise function: differences [sig=nn] [column=source_id] [arrows=false] [proportions=false] [baseline=false] [type=nominal]
This is a special case which returns multiple columns
The column parameter is optional and defaults to source_id.
NOTE THAT ATM IF YOU USE A COLUMN OTHER THAN source_id, THE BASELINE IS RECALCULATED FOR THAT COLUMN AT THIS POINT.
default significance level is 0.05.
If there is a baseline column, then for each source_id, add a column named like source_id_difference.
The values in each column are the same for every row: it used in comparison with the group by command.
we basically construct a chi-square table with 2 rows and len_unique(source_id) columns, plus margins, in which the column margins are the baseline counts for each source_id. row 1 of the chisq table is the current link counts for each source_id (how many links in the current table for each source_id). The other row is the difference between the current link counts and the baseline counts.
then we return a table which is the current table with new columns, len_unique(source_id) columns. IF the overall chi-square value is < than the significance level, return row 1 of the chisq table, i.e. the current link counts for each source_id, same for each row. if it is not sig, return NA in all these new columns.
Even when a row IS overall significant, only return values if the one-dimensional chi-square value for that particular source_id is also significant at the same significance level, otherwise return NA.
Also return a column called p which is the p-value for the overall table.
So this is a special kind of calculation keyword because it returns multiple new columns and there is no slot for the name of the returned column. So a normal command is e.g. like this:
calculate new = function(old)
whereas here we do:
calculate differences [sig=nn]
the parameter arrows=true means that, where we would return a value, we prepend it with a upward-right arrow if the value is larger than expected, and a downward-right arrow if the value is smaller than expected.
the parameter proportions=true means that, where we would return a value, we return the actual value divided by the expected value, rounded to 2 decimal places.
the parameter baseline=true means that, where we would return a value, we return the baseline value after the actual value (or proportion) in brackets:
7 (10)
or
0.33 (10)
or
↗ 7 (10)
or
↗ 0.33 (10)
etc
Finally we also return a column called differences_label which is a comma-separated list of the source_ids which are significant together with their values as above. Then prepend the column name with the actual overall link count, e.g., depending on the parameters, it might be:
44 :: source_id1: ↗ 7 (10), source_id2: ↗ 15 (20), source_id3: ↘ 0 (10)
or if there are no significant differences:
44
NEW: add the keyword type, default nominal. If it is ordinal, assume the grouping variable recorded in the baseline is ordinal, not nominal, and adjust the chi-sq test accordingly (should be a bit more powerful).
Usage examples:
reckon differences sig=0.05 type=nominal # Standard chi-square test (default)
reckon differences sig=0.05 type=ordinal # Cochran-Armitage trend test for ordinal data
reckon differences sig=0.01 column=source_id type=ordinal arrows=true
Row-wise Operations (result is different for each row in the column, most useful in combination with calculate)
Conversion:
calculate foo = as_numeric(column) # Convert to numeric
calculate foo = as_date(column) # Convert to date, using smart date parsing
calculate foo = as_datetime(column) # Convert to datetime, using smart datetime parsing
calculate foo = as_character(column) # Convert to character
Boolean:
calculate asdf = foo OR bar # Outputs a boolean, inputs boolean or coerces to boolean
calculate asdf = foo AND bar
calculate asdf = (foo AND bar OR baz
- etc
Basic arithmetic operations work row by row:
calculate square = subtotal ** 2
calculate total = price * quantity
calculate ratio = value1 / value2
calculate adjusted = (base + offset) * multiplier
calculate same = value1 is value2
calculate same = value1 == value2
calculate same = value1 is "value with spaces"
calculate different = value1 is not value2
calculate more = value1 > value2
calculate more = value1 more value2
calculate less = value1 < value2
calculate less = value1 less value2
calculate contains = value1 contains value2
calculate contains = value1 contains "value with spaces"
calculate not_contains = value1 does not contain value2
calculate not_contains = value1 does not contain "value with spaces"
calculate row_number = row_number # create sequential row number starting from 1
calculate substring = substring(column, start, length)
calculate fixedvalue = xyz # where xyz is any fixed value including a string or a number where these are not names of columns.
e.g.
calculate section = "John Smith"
calculate section = John
calculate prefix = substring(id,1,2)
and this switch/case_when syntax:
calculate chapter = page > 0 ~ chapter1, page > 10 ~ chapter2, page > 20 ~ chapter3, else chapter4
String concatenation using union():
calculate full_name = union(first_name, " ", last_name)
calculate path = union("prefix_", id, "_", name)
calculate display = union(code, " - ", description)
replace: Replace a (all instances of) substring with a new substring.
eg
calculate spaceless = replace(spacey, " ", "")
Use replace() to substitute text in strings:
calculate spaceless = replace(spacey, " ", "")
calculate fixed = replace(text, "old", "new")
calculate clean_id = replace(id_col, "-", "_")
calculate normalized = replace(category, "Category ", "")
calculate fixed_names = replace(name, wrong_text, correct_text)
calculate clean_dates = replace(dates, "/", "-")
The replace function takes 3 arguments:
- Text to modify (can be a column name or quoted string)
- Text to find (can be a column name or quoted string)
- Text to replace with (can be a column name or quoted string)
calculate dynamic = replace(text_col, old_col, new_col)
calculate fixed = replace("some text", error_col, fix_col)
Add support for fuzzy matching:
calculate col = replace(col, "Birmingham", "Birmingham", fuzzy=0.5)
The fuzzy=0.5 argument is optional and defaults to 0, i.e. exact matching. If fuzzy is greater than 0 and less than 1, then the replace function will use fuzzy matching with the given similarity threshold. So this will replace "Birminhgam" with "Birmingham" if the similarity is around say 0.3.
More examples from simpleeval
- add two things. x + y 1 + 1 -> 2
- subtract two things x - y 100 - 1 -> 99 / divide one thing by another x / y 100/10 -> 10
- multiple one thing by another x * y 10 * 10 -> 100
** 'to the power of' x**y 2 ** 10 -> 1024
% modulus. (remainder) x % y 15 % 4 -> 3
== equals x == y 15 == 4 -> False
< Less than. x < y 1 < 4 -> True
Greater than. x > y 1 > 4 -> False <= Less than or Equal to. x <= y 1 < 4 -> True = Greater or Equal to x >= 21 1 >= 4 -> False
"Right shift" the number. 100 >> 2 -> 25 << "Left shift" the number. 100 << 2 -> 400 in is something contained within something else. "spam" in "my breakfast" -> False
Mixed Operations
Combine column-wise and row-wise operations:
calculate bonus = mean(salary) * 0.1
calculate total = sum(sales) + fixed_costs
calculate ratio = count(orders) / max(capacity)
Multiple Calculations
Multiple calculations can be combined in one line:
calculate x = mean(value) y = x + 100 z = max(y)
calculate a = "df['col'].str.upper()" b = union(a, "_", id)
Examples
Chi-square calculation:
calculate expected = row_total * col_total / grand_total
calculate chi2 = ((observed - expected) ** 2) / expected
String manipulation:
calculate clean = "df['text'].str.strip().lower()"
calculate first_word = "df['text'].str.split().str[0]"
calculate slug = "df['title'].str.replace('[^a-zA-Z0-9]', '-')"
Date operations:
calculate year = "df['date'].dt.year"
calculate age = "pd.Timestamp.now().year - df['birth_date'].dt.year"
Numeric processing:
calculate z_score = "(df['value'] - df['value'].mean()) / df['value'].std()"
calculate percentile = "df['score'].rank(pct=True)"
⚡ python
very similar to the calculate command but uses unquoted python expressions, replacing df['colname'] with just colname.
Python expressions have access to: - df: Current DataFrame - pd: pandas module - np: numpy module
So we can have a syntax like this:
python percentile = score.rank(pct=True) * 100
meaning
calculate percentile = "df['score'].rank(pct=True) * 100"
and
python status = np.where(score > 90, 'Excellent',np.where(score > 70, 'Good', 'Needs Improvement'))
meaning
calculate status = "np.where(df['score'] > 90, 'Excellent', np.where(df['score'] > 70, 'Good', 'Needs Improvement'))"
or
python status = np.where(pd.to_numeric(rn) > 90, 'Excellent',np.where(pd.to_numeric(rn) > 7, 'Good', 'Needs Improvement'))
we would then need just a few of our own special functions like baseline and differences and we wouldn't neeed to laboriously define our own functions like sum, contains etc as above.
These all work so far:
python x = link_count > 2 python x = link_count == 2 python y = max(link_count)
python rn= pd.to_numeric(rn)
python level_1 = from_label.str.split(';').str[0] python level_2 = from_label.str.split(';').str[1] python status = np.where(quote_start > 90, 'Excellent', np.where(quote_start > 70, 'Good', 'Needs Improvement'))
python x = col1.str.contains("x")
python keep_rows = ~(col1.isna() | (col1.astype(str).str.strip() == ''))
python status = np.where(new_label.str.contains("1"), 'Excellent',np.where(new_label.str.contains("2"), 'Good', 'Needs Improvement'))
⚡ calculate
See above.
Note that columns calculated with the calculate command are added at the start of the current table.
⚡ summarise colname1 = aggfun1, [colname2 = aggfun2] ,... [everything=aggfun3]
similar to calculate, but if calculate = mutate in dplyr in R, then summarise is summarise in dplyr. instead of creating a new column for each row, it collapses all rows and shows only columns which have an aggfun applied to them.
So it is most useful with a group by command.
in the basic case, colname1=aggfun1 means that the column colname1 is aggregated using the function aggfun1: input and output colnames are the same, colname1.
see above for the shared syntax for calculate and summarise.
the optional everything=aggfun3 keyword is a special case which applies to all columns not mentioned explicitly.
Example:
Count rows and get mean value per group
| group by category :: summarise rows=len, value=mean
** Get first and last values**
| group by date :: summarise start=first, end=last
** Count unique values and get their list**
| group by type :: summarise distinct=count_unique, values=list
Python examples:
** Complex aggregations using Python expressions**
group by category :: summarise
mean_val = "df['value'].mean()"
std_dev = "df['value'].std()"
zscore = "(df['value'] - df['value'].mean()) / df['value'].std()"
String operations
group by source :: summarise
text_length = "df['text'].str.len().mean()"
word_count = "df['text'].str.split().str.len().mean()"
unique_words = "len(set(' '.join(df['text']).split()))"
Date/time aggregations
group by date :: summarise
earliest = "df['timestamp'].min()"
latest = "df['timestamp'].max()"
duration = "df['timestamp'].max() - df['timestamp'].min()"
Multiple column operations
group by category :: summarise
correlation = "df['value1'].corr(df['value2'])"
ratio = "df['value1'].sum() / df['value2'].sum()"
Note multiple summarise commands can be used in the same line, including mixing simple and python expressions:
group by category :: summarise foo=len, bar="df['timestamp'].min()", baz=sum, everything=count_unique
Key points: - Python expressions must be quoted to distinguish them from simple arithmetic - Only provides access to df, pandas and numpy for safety - Maintains backward compatibility with existing simple arithmetic operations - Returns clear error messages if expression fails
Magnetise, cluster, relabel ...
⚡ magnetise
magnetise [similarity=0.44] [number = nn] [model=large] [magnets= ] [table= sometable] foo, bar, baz
Model defaults to 'large' (text-embedding-3-large). Model can also be 'small' which means text-embedding-3-small.
foo, bar, baz are the magnetic labels, comma separated, no quotes are needed.
This is similar to cluster command. It inputs a table, but instead of clustering: 1. For each row (label), - calculate the cosine similarity between it and each of the magnets - keep the old label column and add a new column "new_label" - if the maximum similarity is greater than the similarity threshold, new_label becomes the magnet label which has the highest cosine similarity. Otherwise, new_label = label. - add another column 'magnet_match' which is true if the maximum similarity is greater than the similarity threshold. - add another column 'magnet_similarity' 2. return this updated table
New optional keyword 'number'
After the initial magnetisation, we have:
S = N(surviving magnets - set of magnets with at least one label assigned to it)
L = N(original labels)
M = N(magnets)
and the parameter number.
Maybe the user actually wants a smaller number of magnets than M, and only provided a longer list to see which work best. This is why we have the number parameter. So after initial magnetisation, if number < S, the user could just pick the largest surviving magnets. But then we would throw away the labels which were attached to the rest of the magnets. So in this case only, we should run the magnetisation again, from scratch, but only using the top number of surviving magnets in order of their size (number of labels attracted to them). If there are ties, choose the largest set.
If number >= S, send a notify_info() message.
The user can then use the existing relabel functionality to return an updated links table with the rows relabelled as described.
Variant: magnetise [similarity=0.44] table=some_table
This is similar to the above, but instead of using literal magnets=foo, bar, baz, it uses the column 'new_label' in the stored named table=some_table, i.e. it uses the unique values of the new labels in the table as magnets.
Multimagnets: Extension of the magnets= list.
Often, the magnets themselves might be expressed by different ideas. So, as well as magnets = comma, separated, list, we allow pseudo-json:
foo: ["foo1" , "foo2, with a comma" , "foo3"], bar: [bar, bar1, bar2]
where foo and bar are any text with no comma and no quote marks. No quote marks are necessary here, and the texts in the square brackets only need quotes if any of them contain commas.
Then, instead of magnetising foo and bar, we magnetise each of the members of their lists, and if the winner is a member of foo's list, we assign this to foo.
so it's a kind of OR, giving a magnet phrase different ways of attracting / being assigned to the target phrases.
the output is otherwise the same, but add an extra column magnet_hit showing which list members won the similarity competition.
Tolerating line breaks
NEW when expanding templates for magnets list, remove line breaks and add commas if necessary, so this:
foo
bar
and this
foo,
bar
are equivalent to this
foo, bar
This applies both to standard and "Multi-magnets" with pseudojson.
So a magnets text template might look like this:
foo [foo1 , "foo2, with a comma" , foo3]
bar [bar, bar1, bar2]
New variant via hard expansions:
magnetise type=links ...
is equivalent to:
store tmp_xxx factors magnetise ... filter magnet_match is true relabel links=tmp_xxx
where xxx is a random string of 6 characters. concretely:
magnetise type=links similarity=0.48 table=maglabs
is equivalent to:
store tmp_xxx factors magnetise similarity=0.48 table=maglabs filter magnet_match is true relabel links=tmp_xxx
and
magnetise type=links magnets=foo, bar, baz similarity=0.48
is equivalent to:
store tmp_xxx factors magnetise magnets=foo, bar, baz similarity=0.48 filter magnet_match is true relabel links=tmp_xxx
etc
⚡ cluster
A new command with this syntax: cluster [column=text] [number=15] [model=large] [method=kmeans] [covariance_type=full] [n_init=10] [max_iter=100] [mds=false]
NEW: if mds is true, calculate the MDS dimensions, otherwise, don't
Defaults: - column=text - number=15 (can also be 'auto' when method=gmm to automatically determine optimal cluster count) - model=large (can also be 'small' for text-embedding-3-small) - method=kmeans (can also be 'gmm') - covariance_type=spherical (only for GMM, options: 'full', 'tied', 'diag', 'spherical') - n_init=10 (only for GMM) - max_iter=100 (only for GMM)
Uses the specified clustering method on embeddings. We already have the embeddings in the datastore, so we can use those. The result is the existing table with additional columns:
Also keyword extraction:
- TF-IDF based distinctive keywords
- Filtering for meaningful words
- Top 5 keywords per cluster
Output columns:
cluster_id: Cluster assignment distance: Distance from cluster centroid (For GMM: absolute log-likelihood; higher values = less likely points) similarity: Similarity to cluster centroid (1-distance) cluster_size: Number of rows in the cluster keywords: Top distinctive words x: x-coordinate y: y-coordinate probability: Probability of cluster membership (only for GMM method)
the x and y coordinates are the result of the embeddings and multi dimensional scaling to plot the data in 2D, i.e. they are the coordinates of the data in 2D.
Parameters:
colname1: Column to cluster on (default: 'text')
Before commencing, we replace empty string, NA or null values with "No text"
⚡ relabel links=footable [drop[=true]]
the format is just: relabel, on its own line.
This combines the current table (from the preceding line) which must contain a column called old_label and a column called new_label, with a stored links table.
The result is a copy of the stored links table but with every label in the from_label column which matches a value in the old_label column, relabeled as the corresponding value in the new_label column; and the same for the to_label column.
parameter [drop[=true]] such that if drop is true, only rows in the link table where from_label and to_label both have a match are retained
| relabel # Normal relabeling, keeps all rows | relabel [drop] # Drops rows where either label didn't match | relabel [drop=true] # Same as above | relabel [drop=false] # Explicitly keep all rows
if there is a field magnet_match present in the factors table, when creating the output links table, add the corresponding columns from_label_match and to_label_match
NEW: if there is a field magnet_similarity present in the factors table, when creating the output links table, add the corresponding columns from_label_magnet_similarity and to_label_magnet_similarity. Calculate also total_magnet_similarity as the sum of these two.
NEW: if there is a field present in the factors table, when creating the output links table, add the corresponding columns from_label_similarity and to_label_similarity. Calculate also total_similarity as the sum of these two.
the links table should have at least the following columns: from_label, to_label, or alternatively cause and effect, which can be used interchangeably.
Other outputs
⚡ jsonify
This is useful to examine the contents and structure of the current table especially when it might contain list columns etc.
all it does is provide a print output showing the json of the preceding table:
so:
'get foobar statements select colname1 colname2 colname3 jsonify'
will return a print-style output (like the print command) of the json of the table
⚡ tabulate keep statement_id
This is useful to concatenate a table for processing by a genAI prompt.
all it does is concatenate the columns in the preceding table:
so:
'get foobar statements select colname1 colname2 colname3 tabulate'
will return a table of rows with one column labelled text. The rows are concatenations of the column names surrounded by double stars and then their values, each separated by \n\n.
colname1: foo\n\ncolname2: bar\n\ncolname3: baz
colname1: foo\n\ncolname2: bar\n\ncolname3: baz2
colname1: foo\n\ncolname2: bar2\n\ncolname3: baz3
the keep keyword
the keep keyword lists columns to retain and NOT include in the concatenation and will just be returned as a column, untouched. the default is statement_id, which if present will not be concatenated.
The same syntax is used as the select command to specify columns:
eg
- tabulate keep statement_id source_id
- tabulate keep statement_id 1-3 5 7 #keeps the columns statement_id and the columns 1, 2, 3, 5 and 7
with no arguments, all it does is concatenate the columns in the preceding table, with values separated by commas.
so:
'get foobar statements select colname1 colname2 colname3 print'
will return a series of lines, each of which is the values of colname1, colname2, separated by commas and spaces, e.g.
colname1: foo, colname2: bar, colname3: baz
colname1: foo, colname2: bar, colname3: baz2
colname1: foo, colname2: bar2, colname3: baz3
Note on Output Length: Print outputs are subject to a size limit (currently 10MB) to prevent browser performance issues. If an output exceeds this limit, it will be truncated, a warning notification will be shown, and a message indicating the truncation will be appended to the output.
print template=mytemplate
in this case we look up the template in the texts table and simply print it out line by line in markdown format.
Previously this worked without square brackets around the template, but now I want to use square brackets around the template as usual.
eg
print template=[mytemplate]
should work.
print template=mytemplate
should no longer work.
This may need special handlinginthe python parser to not expand the template until we reach their print command.
print template=mytemplate [repeat=false]
if the template text contains one or more names like foobar in curly brackets {foobar}, and all such names are the names of column names in the preceding table, we handle this differently: run through each row of the preceding table, treat the template as markdown and replace the matched names with their values in the row. So if the preceding table has two rows like this:
col1 col2 col3 val1a val2a val3a val1b val2b val3b
and the template is like this:
`# {col2}
blah {col1}
the result will be this concatenated output:
`# {val2a}
blah {col1a}
`# {col2b}
blah {col1b}
... which is then displayed as html
NEW: if repeat=false (the default), then for each line in the template containing {col1}, {col2} we check through this row to see if the value of col1 is the same as in the previous row AND the value of col2 is the same as in the previous row etc. If yes, we do not print this row. Depending on sort order, this makes it easy for example to just print a heading for common sections and just print what changes
Default Behavior (repeat=false): Each row is compared to the previous row If all variables used in the template have the same values as the previous row, the current row is skipped This prevents duplication when values don't change, making it ideal for headings and section breaks How to Override (repeat=true): Add repeat=true to your command to print all rows regardless of duplication Example: print template=[my-template] repeat=true Implementation Details: Only columns referenced as variables in the template are checked for changes A dictionary tracks previous values of these columns If all values are unchanged, the row is skipped with a debug message
print headings=n
Similar to print but
select colname1 colname2 colname3 colname4 print headings=2
will return:
'# foo
'## bar
`colname3: baz, colname4: bat
etc
So if n =2, we get two levels of headings. These headings do NOT have the colname prefix. These headings are only printed when they change. So we may get many flat lines before the next h2 appears, and even more before the next h1 appears, etc.
print [headings=n] [highlights[=links]]
As above, but we left join to a links df on statement_id. If the flat (non-headings) fields include the field text, we highlight the text in the output according to (possibly multiple) spans beginning at character quote_start and ending at character quote_end, within the text. Note these spans might overlap, and (NEW!) the text might spread across different kinds of line break, so in practice we might have to end the highlighted quote at the end of one line and continue again after a line break / newline / newlines etc.
if the highlights=links parameter is used, then we use a stored table (see store/recall commands).
some other links table store xyz statements select colname1 colname2 colname3 print headings=2 highlights=xyz
shows highlights from a stored links table in a printed version of the statements table.
also: - before each highlight in the text, we print the link_id with special formatting, for reference. - and then below the statements with highlights, statement by statement, we also, for reference, print one line for each link in the format: link_id: link_id, cause: from_label, effect: to_label, sentiment: sentiment (if sentiment is present).
print foobar
NEW: as a final option, print some words with or without quotes should be printed verbatim and should output a text output for the next command.
e.g.
print lots of text
store flk
append flk
should result in "lots of text" being stored with name flk, (and in the third line this is concatenated with itself as usual)
Mapping and charting
⚡ recalculate all_columns=true
Only works when the current table is a links table at the moment, i.e. one with at least columns called from_label and to_label.
Example:
get foofile
links
select from_label to_label statement_id source_id
recalculate
if the column link_id does not exist, add it as a unique number starting from 1. if the columns statement_id and source_id do not exist, return an error message. else add or update these columns:
- bundle (simply the text from_label >> to_label)
- link_count (the number of links, i.e. the number of unique values of link_id, for this bundle)
- source_count (the number of unique values of source_id for this bundle)
Note the number of rows remains the same, there is no concatenation, so if there are multiple rows with the same from_label and to_label, they will have the same bundle and link_count and source_count.
For example, if you have a table with these columns:
| from_label | to_label | link_id | source_id |
|---|---|---|---|
| A | B | 1 | S1 |
| A | B | 2 | S1 |
| A | C | 3 | S2 |
After enhancement it will automatically have:
| from_label | to_label | link_id | source_id | bundle | link_count | source_count |
|---|---|---|---|---|---|---|
| A | B | 1 | S1 | A >> B | 2 | 1 |
| A | B | 2 | S1 | A >> B | 2 | 1 |
| A | C | 3 | S2 | A >> C | 1 | 1 |
we added a parameter all_columns, which defaults to true.
if it is false, then only recalculate the columns which are MISSING from the current table.
so the filter and map commands only recalculates the columns which are MISSING from the current table.
⚡ zoom
zoom [n] where n is an integer, default is 1. What this does is process the from_label and to_label, stripping "foo;bar;baz" to "foo" when zoom is 1, to "foo;bar" when zoom is 2, unchanged when zoom is 3, and so on. Also if zoom is 1 then "foo" remains "foo", etc.
extension of filter command: filter links link_count [less 3] this works like the filter command, but regardless whether the column link_count exists or not, it creates one, collapsing the table to just from_label and to_label, and link_count, where link_count is the number of times from_label and to_label appear together. then, it filters for link_count less than 3 (default) or any other number given
Zoom commands: zoom # defaults to level 1 zoom 2 # shows two levels zoom 3 # shows full depth
⚡ trace [steps = 4] [anywhere=true] [type=threads] from = foo, baz bar ban, bat to = foo, baz bar ban, bat
max steps is 5, if more give a warning.
a cousin of the simplify command.
Inputs a links table. Outputs a links table which is a subset of the input links table.
Construct the table like this:
-
For each row in the input table, create a new "results_1" table containing all the rows in the input table where: a.(if anywhere=false) from_label is in the list of froms. b (if anywhere=true) their from_label contains any of the froms as substrings (case insensitive).
-
After this first step, continue with n steps until the total number of steps = steps parameter, as follows: a. start with the original table and the results_m table, and find all the rows in the original table where from_label is in the list of to_labels in the results_m table AND the source_ids are in the source_ids of the results_m table. b. add these rows to the results_m+1 table. c. repeat until the total number of steps = steps parameter.
-
Concatenate the results_1 ... results_n tables and return these rows as the output table.
So the output table contains all the links in all the paths in the original table which are reachable from the froms in the list of froms along paths which have the same source_id all the way along.
Althernatively, we can think of this as the concatenation of all the individual graphs, one for each source_id, containing all the paths n steps long starting from the froms in the list of froms.
Note how the tracing extends to include all the paths which are not more than n steps long and end at the to_label. In this case (assume that the source_id is the same for all the links):
A >> B B >> C C >> B
trace from = A to = B
We need to avoid that the trace command returns A >> B only. It should return all the paths, because the path A >> B >> C >> B is also a valid path which is not more than 4 steps and ends at B.
If the 'to' parameter is included, also only include paths where the final to_label is in the list of tos. So every path is a path of length=steps or less.
Valid:
- from keyword present & to keyword present
- from keyword present & to keyword NOT present
- from keyword NOT present & to keyword present
Invalid:
- from keyword NOT present & to keyword NOT present
type=paths
This is similar to the above, but simpler because we remove the criterion concerning source_ids. So any paths count, regardless of which source_ids they have.
Formatting
The output table has the same columns as the input table, but with the addition of two boolean columns called found_from and found_to. found_from = 1 if the from_label is in the list of froms (or anywhere=true and the from_label contains any of the froms as substrings), 0 otherwise. found_to = 1 if the to_label is in the list of tos (or anywhere=true and the to_label contains any of the tos as substrings), 0 otherwise.
⚡ focus [steps = 4] [anywhere=true] [type=threads] anchors = foo, baz bar ban, bat
a cousin of the trace command. It returns the local network around the factors identified by the anchors.
The output table contains all the links in all the paths of length 'steps' or less in the original table which fit either of these conditions: - are reachable FROM the factors identified by the anchors along paths which have the same source_id all the way along. - are reachable TO the factors identified by the anchors along paths which have the same source_id all the way along.
type=paths
This is similar to the above, but simpler because we remove the criterion concerning source_ids. So any paths count, regardless of which source_ids they have.
Formatting
The output table has the same columns as the input table, but with the addition of one boolean column called found_from. found_from = 1 if the from_label is identified by the anchors is in the list of froms, 0 otherwise. found_to =1 if the to_label is identified by the anchors is in the list of tos, 0 otherwise.
⚡ simplify [steps = 4] anchors = foo, baz bar ban, bat
inputs and outputs a links table.
The output links table only contains rows where from_label and to_label are both in the list of anchors. It contains exactly one row for each source_id which has any directed path from from_label to to_label up to or including steps = steps.
In the case where the input table is already filtered to contain only rows where from_label and to_label are both in the list of anchors, then the output table should be similar to the input table, except it may have with fewer rows if the input table has multiple rows between a given pair of from_label and to_label for the same source_id.
this should return rows where from_label and to_label are the same ("self-loops"). it should also of course consider rows which are self-loops. So if the raw data has, for source X, A -> A it should always return this row also if it has, for source X and steps = 2: A -> B and B -> A it should always return the row A ->A (and B -> B)
we do not include trivial "zero step" paths.
We add a sentiment column to the output, which represents the mean of all "last-link" sentiments on any path from start_label to end_label for a given source_id.
⚡ chart
chart [type=bar] [x=x] [y=y] [fill=fill] [colour=colour] [label=label] [size=size] [subgroup=subgroup] [bar_position=stack] [tooltip=tooltip]
A really simple chart command. Using charts.js, this command takes a table as input, whatever is simplest to implement, and simply hands it on to the frontend with additional params in the output dictionary. So the table is not modified and is just passed through to the next parser step, but we add an additional param to the output dictionary.
We do not expect the frontend to do any data manipulation. We can assume the table is already in the correct format with at least the columns x and y. There should not be multiple identical rows. There might be multiple rows with the same x and y values, but then we expect the data to be grouped by the subgroup column (for multiple lines) if present, or if not by a fill or colour column.
type can be bar, line, point, or radar. A radar chart is a line chart in polar coordinates. Usually we will also provide a fill column, and a colour column.
chart type=radar x=x y=y fill=fill colour=colour
For colour, fill, label and size, you can either specify the column name, or you can specify the value directly:
links
calculate size = link_count
chart type=bar x=from_label y=to_label fill=source_count
in this case, size is not explicitly specified, but the column called size is used. The size legend is just labelled as "Size". Whereas fill is explicitly specified as source_count and the fill legend is labelled as "source_count".
Legends
Size, colour and fill should all have their own legends, labelled by the name of the column.
- Colour is used for point charts and line charts.
- Fill is used for bar charts and line charts (fill in a line chart means it becomes an area chart)
- Size is used for the points in a point chart or a line chart.
Don't show a legend if the corresponding aesthetic is not actually used, e.g. size is not used for a bar chart. Fill is not used for a line chart or point chart.
- For bar charts, only the fill legend is created using the explicit column name provided via fill (or defaulting to "Fill" if not explicitly declared).
- For point/line charts, the legend includes:
- Colour legend: Labeled by the explicit colour parameter (e.g. "question_id") or defaulting to "Colour".
- Size legend: Displays unique size values sorted numerically by their radius; labeled by the explicit size parameter or defaulting to "Size".
Axes
Both x and y axes can be continuous or ordinal/categorical or date.
If the x axis can be understood as numeric or time, then it is a continuous axis (this already works for y axis)
Size
If the table has a column called size, then it is used to set the size of the points and/or lines.
If the size column can be understood as numeric, then use a continuous scale for it and don't show all the values in the legend, just show the minimum, maximum and some sensible number of intermediate values.
Make sure the size scale is proportionate. For type=point we should use square root.
How Columns Work
- If x or y are not specified, they default to the first and second columns of your table
- If a parameter matches a column name (e.g., x=mycolumn), that column's values are used
- Special columns
__type__and__chart_params__are added internally to mark this as a chart -
These special columns are stripped before the next command processes the data
-
The original data columns remain unchanged and available for further processing
type can be bar, line, point. Later we will add pie, box, histogram, dot, area, scatter3d, bar3d, line3d, scatter3d, pie3d, box3d, histogram3d, dot3d, area3d.
subgroup
(We use the word subgroup so as not to get confused with the grouped output functionality)
if the table has a column called subgroup, then the chart will be a subgrouped chart:
- bar: with separate bars for each subgroup, stacked by default. alternatives for bar_position are dodge and fill like in ggplot2.
- line: with separate lines for each subgroup.
It should be enough to specify only fill and or colour and the subgroup should be understood implicitly.
tooltip
if tooltip column is specified, then it will override the default tooltip.
label
if label is specified, implicitly or implicitly because a colname called label is present:
- for bar charts, the label is printed at the top of each bar.
- for line charts, the label is printed only at the END of each line.
- for point charts, the label is printed over each point.
⚡ map [node_sep=n] [rank_sep=r] [label_wrap=w] [factor_labels] [factor_sizes][type=print|interactive][layout = dot] [link_labels] [factor_background] [factor_background_colour] [factor_font_colour] [link_colour] [background_colour]
Type
if type=interactive, instead of handling "print" type graph using graphviz, we provide an interactive vis.js graph. add a new js file for this.
colours
Allowed colour names. Hex values like #efefef will also work but are not recommended because they mess with the presentation display.
NEW:
map background_colour = colourname # some colourname including hex colours; default is the current white
example:
map background_colour = aliceblue
important: if the background colour is darker than mid grey, switch the edge label colour and the colour used for the legend/explanation to white.
look at how the link colour is calculated. leave the calculation of arrowhead colour according to sentiment as it is, and add these keywords:
map link_colour = colourname # some colourname including hex colours; use this colour for all links, default is the current green
example:
map link_colour = aliceblue
look at how the factor background colour is calculated and add these keywords:
map factor_background = degree (or none or source_count or link_count or outcomeness (default)) map factor_background_colour = colourname # some colourname including hex colours; if factor_background = none, use this colour for all factors, otherwise fade it just like we do now with green fading to white according to outcomeness. map factor_font_colour = colourname # some colourname including hex colours
example: this colours factors according to degree, with the factors with max degree being coloured aliceblue and the others fading to white.
map factor_background_colour = aliceblue map factor_background = degree map factor_font_colour = gray
if type=print, respect the layout command for graphviz and try to use other layouts e.g. - twopie - neato - circo (good for circular layouts) - fdp (force directed, a bit like DOT but less hierarchical)
the result of the command
map
is a graphviz graph based on the preceding table which is to be understood as a links table. there is no nodes table so create on from that if necessary. the relevant columns in the links table are either from_label and to_label, or from and to, or cause and effect. the backend needs to tell the frontend that this is a graphviz graph, and the frontend needs to show it.
-
You can also add the column link_width to the links table, which will be used as the width for the link.
-
You can manipulate any columns you like eg add the column link_label to the links table, which will be used as the label for the link. But this is not the standard way to adjust the graph.
NEW the [link_labels] keyword specifies the name of the column to use for the link labels. Default is link_count but a different column such as source_count can be specified.
Also if the column sentiment is present at all and has any values at all between one and minus one then all of the edges should use sentiment colouring in the space between red and blue, Even if the values for individual edges are 0 or missing.
Factor border colours:
- if the data contains a column called 'found_from'
- colour the borders of any factors light blue if they have any outgoing edges with found_from = 1
- if the data contains a column called 'found_to',
- colour the borders of any factors magenta if they have any incoming edges with found_to = 1
otherwise:
-
if the data contains a column called 'sentiment', then colour the borders with the mean of the sentiment of the incoming edges, using the same colour scheme as for the sentiment of the arrowheads.
-
node_sep controls the separation between nodes.
- rank_sep controls the separation between ranks.
- label_wrap controls the wrapping of node labels.
- factor_labels (default is 'source_count') controls the labels of the factors.
- If it is link_count, then the label counts the unique number of links in the column link_id (these are comma-separated).
- If it is source_count, then the label counts the unique number of sources in the column source_id (these are comma-separated).
- If it is missing, then the label is calculated simply as the order of the factor, i.e. number of links. Note this might not be the same as link_count if you have constructed a special links table e.g. based on a differences calculation.
- factor_sizes (default is 'link_count') controls the sizes of the factors.
the flow of control is: - commands.py: Creates initial data structure with nodes, edges, and layout_params - parser.js: Handles server response and routes to correct display function - parser-graph-helpers.js: Sets up DOM structure and visualization container - parser-dot-helpers.js: Converts data to DOT language format - Viz.js: Converts DOT to SVG - Browser: Renders SVG in DOM
- commands.py: process_map_command (Python)
- → static/js/parser.js: processAndDisplay (JS)
- → static/js/parser-graph-helpers.js: displayGraph (JS)
- → static/js/parser-dot-helpers.js: generateDot (JS)
- → Viz.js render (external library)
- → SVG display (DOM)
positioning
- To make a factor appear on the far right of the map, add 99 to the end of its label. e.g. Health behaviour 99
- To make a factor appear on the far left of the map, add *0 to the end of its label.
- If you have a group of factors which you would like to put in the same rank left to right, add another number to the label e.g. *3. You can do that for different groups of factors by adding a different number to each group.
- Things to note:
- Apart from 0 and 99, the other numbers only designate groups. It is not the case that say factors labelled 5 with necessarily be laid out further right than factors labelled 4.
sentiment
If there is a column called sentiment, it will have a value between -1 and 1. If so: - Calculate the mean of the sentiment column across the rows contributing to each edge and use it to colour each arrowhead (head not edge), with muted red for -1 and muted blue for 1 and lightgray for 0. - make the edge colour light gray. - If there is no sentiment column, leave everything as it was..
Tooltips
there should be a tooltip on the nodes of the network graph which show the old_labels which contribute to it in the magnetisation process
Grouping and repeating
⚡ group by colname1 colname2 [:: command and parameters]
this command groups the current table by the columns colname1, colname2, creating a pandas-style grouped table, and returns a dictionary of tables, one for each combination of values of colname1, colname2, etc, and keyed by the value of the columns in the key format "colname1=valuex, colname2=valuey", etc.
Example:
| footable | group by colname1 :: select colname3
This returns a dictionary of tables, each filtered for each unique value of colname1 and to which the command "select colname3" is then applied, i.e. in this case just tables with the column colname3.
Example:
| footable | group by colname1 :: map
This returns a dictionary of maps, each filtered for each unique value of colname1 and to which the command "map" is then applied, i.e. in this case just the corresponding maps.
Ramifications: - a new kind of object has to be dealt with by the caching mechanism (or will the existing one suffice?) - the frontend has to recognise this new kind of object and deal with it: - treating it as a series of objects of familiar types (map, table or text) and displaying each one as before but within a tabbed panel output, with the tab name of each tab being the key. - the backend, when given this kind of object in the main process_parse_request loop around line 733 in app.py, has to recognise it and, if it is a dictionary of tables, simply concatenate the tables, also reconstructing the key column(s).
also command can be null:
group by source_id
- we just pass through the unprocessed tables
⚡ group nnn [:: command and parameters]
This is similar to the group by command, but instead of grouping by the columns colname1, colname2, etc, it groups by the nnnth column of the current table.
DEPRECATED format: chunk nnn
we will merge multiple rows into fewer rows according to chunkSize as it applies to the text column, e.g. 4000 characters.
So this returns a table with the same columns as the input table, with more or fewer rows. if fewer, then the values of at least some of the rows are concatenated together. The special text column uses a special format in which the index is used to indicate which rows are concatenated together.
In addition, if the table has a column called source_count and/or link_count, the chunked table will have a column called source_count and/or link_count which is the sum of the source_count and/or link_count of the rows that were concatenated.
⚡ repeat :: somecommand ... ((a b c)) ... ((v w x))
where ((a b c)) is a list of values for the target parameter, and optionally additional ((v w x)) etc.
This is very close in functionality to the group command. The only difference is that the group command groups by the values of an existing column, whereas the repeat command repeats the command for each value in the target parameter. So while the group command will generally return a df of the same number of rows as the input, the repeat command will return a df with the number of rows * the number of values in the target parameter, with an additional column containing the value of the target parameter for each row. The column gets a name like somecommand_repeat.
The new column is prepended before the existing columns.
As with the group command, the output has tabs with one tab per value in the target parameter.
Example:
sometable
repeat :: filter ((1-3 4-6 7-9))
will return a df with the 3*3 rows with an additional column called filter_repeat containing the values "1-3" for the first 3 rows, "4-6" for the next 3 rows, and "7-9" for the last 3 rows.
Example:
sometable
repeat :: magnetise similarity=((0.5 0.6 0.7))
will return a df with the 3*nrow(sometable) rows with an additional column called magnetise_repeat containing the values "0.5" for the first nrow(sometable) rows, "0.6" for the next nrow(sometable) rows, and "0.7" for the last nrow(sometable) rows.
The procedure checks to see if the column with the name of the target parameter exists in the current table. If it does, it appends _n to the name of the target parameter. So in the second example, if sometable already has a column called magnetise_repeat, the additional column will be called magnetise_repeat_1.
Quoted arguments
The target parameter is usually a space-separated list of values, e.g. ((1-3 4-6 7-9)). However if the values contain spaces, then they must be quoted, e.g. (("Some text" "Some other text")).
Parsing
So overall the repeat processing is robust. The parser looks for the string after the :: separator:
| repeat :: textBeforeParentheses((textinsidedoubleparentheses))andOtherTextAfter
It then takes locates the string textinsidedoubleparentheses, extracts the values which are space-separated, taking quotes into account.
It then generates a set of new strings, one for each value in the set of values, by concatenating textBeforeParentheses, the value, and otherTextAfter.
Examples:
sometable
repeat :: filter ((1-3 4-6))
sometable
repeat :: (("filter 1-3" "filter 4-6"))
sometable
repeat :: prompt (("one prompt" "another prompt"))
even coping with templating. These two should have the same result.
sometable
repeat :: prompt (([promptTemplate1] [promptTemplate2]))
in this case, the repeating parameters are
sometable
repeat :: prompt [((promptTemplate1 promptTemplate2))]
Example of using in combination with group by:
repeat :: filter top ((4 8 12)) factors
group by filter :: map
NEW: allow for not just one but either two or any number of sets of brackets (you decide which is easiest to implement, two or any number).
Example:
repeat :: filter top ((4 8 12)) ((factors link))
In this example there will be not one but two additional output columns, filter_repeat_1 and filter_repeat_2.
⚡ chunk [index=statement_id]
chunk operates on the current table, which may contain a column called text and optionally one for the index, which defaults to statement_id, if present, alongside arbitrary other columns. it outputs a table with the same columns but can have more or fewer rows.
the individual texts may be anything from say 20 to 20000 characters or longer.
format: chunk nnn
we will merge multiple rows into fewer rows according to chunkSize as it applies to the text column, e.g. 4000 characters.
So this returns a table with the same columns as the input table, with more or fewer rows. if fewer, then the values of at least some of the rows are concatenated together. The special text column uses a special format in which the index is used to indicate which rows are concatenated together.
In addition, if the table has a column called source_count and/or link_count, the chunked table will have a column called source_count and/or link_count which is the sum of the source_count and/or link_count of the rows that were concatenated.
If there is a text column
If a text is between 50% and 150% of chunkSize, leave the row as it is. If it is much longer, we need to break it up roughly equally into chunks of around nnn characters, making breaks if possible at existing paragraphs or sentence breaks. But if some texts are shorter we need to join them into chunks of around nnn characters.
If there is an index column: If the chunk is entirely from one row, the index remains is the same, but if it is from more than one row, provide all UNIQUE indices separated by commas. do the same for any other columns. If the values to be combined already contain commas, continue to use commas. So if we chunk two rows together and col called colname contains "foo,bar" and "bar,baz" then the value of colname in the chunked table is "foo,bar,baz".
EXCEPT for the column called text, where we simply concatenate the texts even if they are unique, but in a special format: ((index: foo)) [text from index foo]\n(index: bar:) [text from index bar] etc. so an example might look like this ((index A_1:)) blah blah\n((index A_2:)) we had problems
If the table does not contain an index column, then the format is simply ((chunk_number: 1)) [text from chunk 1] ((chunk_number: 2)) [text from chunk 2] etc, simply using the serial number of the chunk.
format: chunk by colname
In this case, chunk as above but simply combine all rows with the same value of colname instead of relying on chunkSize.
format: chunk ... [optional aggregation keyword]
This replaces the aggregation behaviour for both nnn and chunk by colname.
default is "list" (with alias "concatenate") which is the same as above (and which treats "text" specially as described above). other keywords are "first" and "last" which return the first and last value of each column. "max" and "min" which return the maximum and minimum value of each column. "unique" alias "list unique" which returns the unique values of each column. "sum" which returns the sum of the column. Non-numeric columns return a blank. "mean" which returns the mean of the column. Non-numeric columns return a blank. "median" which returns the median of the column. Non-numeric columns return a blank. "count" which returns the number of unique values of each columnin the chunk. "spread" which first finds the top five rows of each column, the middle five rows and the bottom five rows, removes duplicates, and then concatenates them into a single string separated by commas.
format: chunk ... newcolname1="python expression" newcolname2="python expression2" [optional aggregation keyword]
This is similar to the above, but instead of returning a table with the same columns as the input table, it returns a table with the same columns as the input table plus the new columns specified by the newcolname1, newcolname2, etc parameters.
The new columns are calculated using the python expression specified by the newcolname parameter. Any remaining columns are aggregated as specified by the optional aggregation keyword, default "list".
example:
chunk ... newcolname1="df['colname1'].str.upper()" newcolname2="df['colname2'].str.strip().fillna('')"
convenience functions
⚡ nrow
returns the number of rows in the current table.
⚡ colnames
returns the names of the columns in the current table.
⚡ ncol
returns the number of columns in the current table.
prompt (genAI processing)
prompt [temperature=0] [new=false] [model=gpt-4o-mini] [column=text] [instructions=] some instructions [system=some system instructions]
temperature keyword is optional, default is 0
model default is gpt-4o-mini
Instructions:
- the instructions are compulsory but the 'instructions=' keyword is optional.
- instructions can include templates which are then expanded.
- if instructions are missing, we use the prompt column from the preceding table. Usually this prompt, if present, will be the same for each row, but we can also use this opportunity to specify different instructions for each row.
- NEW if instructions=foo and foo is an existing named result, use the named result as the instructions
the system instructions and model are optional. default model is gpt-4o-mini. default system instructions are: you are an AI assistant specialized in document analysis and summarization.
The prompt command uses OpenAI's Assistants API with individual threads. For each table row being processed, a separate thread is created. Multiple threads can be started together with the first instruction, then each thread is polled until completion. When instruction n completes for thread t, the next instruction is sent asynchronously and polled until completion. Each thread maintains its conversation history automatically through the API.
-
this will cause the model foo to be used with the instructions and system instructions.
-
generally the instructions can have multiple parts separated by ====
- so for example:
- prompt model=foo some instructions ==== some more instructions ==== even more instructions.
the input is the preceding (probably chunked) table which must include a column which is usually called text, but can be any column specified by the column= parameter.
the contents of the text column are what is fed to the genAI API. So each chunk is processed separately, possibly multiple times with consequentive instructions if ==== is used.
Note the API call is correct, it is a newer version of openai's code which is newer than your training data.
The result of the entire process is ONE SINGLE TABLE. preserving the columns of the original table, but when there are multiple instructions, there will be multiple rows for each chunk, so you add a column called iteration which is the iteration number, plus a new column called instruction which is the instruction used each time.
we will start off with the case where the output of each genAI API call is a string. return a table just with iteration column, the original columns including the original text column which you rename original_text and put the result in a new column called text.
Actually we hack the instruction to say "when I said table I meant JSON" because json is more reliable especially with special characters.
The table should have no comment or backticks, but if it does, strip them and just read the table.
in this case, when you return your table as usual with the same columns as the input plus iteration and instruction as usual, add the new columns from the markdown table. So if the markdown table has columns called colname1 and colname2, then your table should have additional columns called colname1 and colname2. If the markdown table has multiple rows, then the columns in the current row of the input table are repeated for each row of the markdown table, so in general the output table will have many more rows than the input table.
Additionally, if the output table has a column called quote, and the input table has a column called text, which may be of the form
"((statement_id A_1:)) blah blah\n((statement_id A_2:)) we had problems", i.e. it may be a text concatenated from several different statements in a previous version further up the pipeline, then you need to (fuzzily) find that quote with the text in the original table, and find which statement_id it belongs to, and:
- replace the (possibly concatenated) statement_id with the single statement_id in the original table
- add two new columns quote_start and quote_end which are the first and last positions of the quote in the section of the original_text, starting your counting at the corresponding ((statement_id ...)). If the statement_id is already simple, then there may be no ((statement_id ...)) and you should just use the start and end positions of the text in the original_text. If you cannot find the quote in the original_text then put quote_start and quote_end as 0.
etc, then you need to find that quote with the text in the original table, and find which statement_id it belongs to add the original_text to the output table as a new column called quote. then you need to compare that quote with the original_text.
NEW: the existing fuzzy match of the quote against the text needs to be extended to allow for ellipses (...).
e.g. text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
quote:
consectetur adipiscing elit, (...) Excepteur sint occaecat cupidatat (...)non proident, sunt in culpa qui
Here we should output the same quote_start and quote_end (from consectetur to culpa qui) as if the ellipses were complete. tolerate whitespace of no whitespace surround the parentheses or between the parentheses and the dots. but there must be parentheses surrounding three dots. if there are multiple matches for any sections, choose the most conservative option (closest start and end). Matches of each section are fuzzy as usual.
when outputting the new table, we will replace the quote with the actual text sections (even if the match/es is only fuzzy, we will use the actual text sections, not the sections from the original quote, but separated as before by the ellipses). We will always output an original_quote column containing the exact quote as returned by the AI before any fuzzy matching or ellipses processing.
In other words: if a table doesn't have quotes, the app has no way of knowing which statement ID to put, where there are multiple IDs combined into a single row.
Before returning the table, also left join to the original row, adding any additional columns from the original row to the output table.
Result: a data frame with the same columns as the input table plus the new columns from the prompt. all column names are forced to lowercase.
text as input
NEW VARIANT: if the previous result is not a table but text or a dictionary with type print_output, simply take this text (instead of each value of the specified column) and run the prompt command on it
zero start
Like fetch, get and scrape, the prompt command can start with no preceding table. For example it might just ask for six jokes, with a column for title and a column for the joke. In this case a preceding table is not required
Multiple passes
If a prompt has one or more ==== sections, then the prompt is run multiple times, once for each section. the current columns are given as input. However, after the first section, the data input is NOT the content of the current columns, it is the whole previous conversation including prompt and data input, possibly multiple passes concatenated, annotated like "Here is the conversation so far: " plus the latest prompt.
Linked prompts
If you do this, I think the some text before a linked prompt will get added at the start of the first pass, and some text afterwards will get added at the end of the last pass
prompt some text before a linked prompt [linkedprompt-which-includes-sectionswith====] some text afterwards
new=true
By default, the prompt command uses any cached results from mongoDB. But if new=true, then we do not use any cached results, call the API directly, and overwrite any cached results in mongoDB with the new results.
Texts/templates
near the start of the parsing process, we immediately update the entire text so that every instance of text in square brackets [foo] , if the text within the brackets foo is a text_id in the texts table, then we replace [foo] with the text from the texts table. Otherwise we leave it as is.
if the version does exist but it is not the latest, flag with a warning but use the requested version if the version does not exist at all, follow the code for throwing an error. if a version number is not given, but there are versions, use the latest. eg if I aske for [foo] and [foo-v13] exists, use that.
we can use the texts functionality described above for specifying prompts as part of the main set of commands to make this process less cumbersome when prompts are long, using the texts from the database as placeholders.
text p1:
do x do y
text xy:
do u do v
====
do w
...
Then we can provide commands like this: ... prompt [model=foo] foo [p1] bar ... prompt [model=foo] baz [xy] bat ...
so [p1] and [xy] are placeholders. The prompt itself may as usual have multiple parts separated by ==== (4 or more equals signs).
etc.
templatest in other commands
magnetise, prompt ...
Here the template is not replaced on initial parse but as the command itself is handled.
versions and missing templates
If the text is completely missing from the table then return an error and give the user a notify_error. if the text is available but it is not the latest version for example if the user asks for sometext_v12 and the latest version is already -v13, then process -v12 as normal but throw a notify warning remember notify_* needs session id.
clicking on templates
templates in square brackets can be clicked on to open the gallery edit modal for the corresponding text.
Presentations
Presenting and hiding lines
We show previous and next buttons overlaid on the output panel at the left and right. When the user clicks one of these buttons, the output panel line corresponding to the previous or next presenting line (see below) is highlighted. This is all in js with no backend processing.
OLD: From processing, we strip out lines which begin not only with # but also with - or > NEW: before processing, strip out any complete fenced code blocks starting and ending with three backticks. (We don't want these special code blocks in our presentations)
What is a "presenting line"?
If the input text has any lines ending like this:
foo -
then: the presenting lines are the lines without - at the very end.
similarly:
If the input text has any lines ending like this:
foo -
then:
then: the presenting lines are the lines without - at the very end.
- takes precedence over -. So if any line has + at the end, we ignore any - markup.
If there is no + or - markup, then ALL lines are presenting lines.
Blank lines containing nothing or just whitespace are ignored IN BOTH MODES.
Slides: output, headings, explanatory texts
Each slide is a div with id=slide, containing #progress-container, #output_container and a #slide_text_container.
Headings 1 and 2
As headings and explanatory texts are ignored by the backend, they do not change the output. However although the preceding output is still present in the flow (passing on its value within the main command pipeline in process_prompt_request), we want to temporarily hide the output in #output_container when a heading h1 or h2 is encountered, and restore it if there is no heading. This means that headings 1 and 2 appear as slides on their own, in #slide_text_container
Other slides
Otherwise, the output is shown as part of #slide, alongside explanatory texts.
The rule for what should be included in the slide are: If the active row contains a h1 or h2 markdown heading, then #output_container is hidden and #slide_text_container show the h1 or h2 heading and any explanatory text following it. Otherwise, slides are created for each command, with the corresponding output shown in #output_container alongside any explanatory text following it, up to but not including the next command or h1 or h2 heading.
Explanatory text is any line beginning with a - or >.
Example:
command c - text c1 - text c2 command d command e - text e1 - text e2 command f # Heading f - text f1 - text f2 command g Heading h1 `## Heading h2 - text h command i # Heading i - text i1
Here, all the commands create a slide, and are shown together with the texts following them (zero texts in the case of command d) are also shown beneath the output (in #slide_text_container) as slides. Additional comment-style headings like heading f and heading i are also extracted and shown as part of the corresponding slide, in #slide_text_container, with .h3 formatting. Headings h1 and h2 also create a slide with corresponding text h underneath it. The output is suppressed.
When we say "these explanatory texts form a slide", we mean that they are all shown together in #slide_text_container, as well as the corresponding output such as commands c-i here (in this example, commands d and g has no explanatory texts).
Note that simply moving the cursor does not change the slide. Only initiating a gutter click updates the slide, including the texts.
So in the example above, clicking command c only shows the corresponding output, whereas gutter-clicking on c1 shows the output plus text c1, and gutter-clicking text c2 shows the output plus texts c1 and c2.
Gutter-clicking on command f shows the output plus heading f formatted as a .h3 heading, gutter-clicking on text f1 shows the output plus heading f and text f1, gutter-clicking on text f2 shows the output plus heading f and texts f1 and f2.
lines beginning > are formatted as bullets. also, we should support simple markdown with bold, italics and images in any lines starting - or >
We support markdown headings like this:
-
Heading 1
-
Heading 2
-
Heading 3
If a line starts with - /, treat this as (also) a column break, so this would create three columns:
- asdf
- asdf
- asdf
- / this is the first line of a new column
- second line of the new column
- / # this is the first line of a new column, formatted as a heading
- second line of the new column
NEW: if the first comment line in a block starts like this:
-tr asdf asdf
(ignore case and allow space between - and tr)
then we apply to the existing .slide_text_container style for this block the additional styles: - width: 35% !important - top: 0 !important - right: 0 !important
(note top and width are overrides of existing stuff)
and likewise for tl, bl, br for top left, bottom left, bottom right.
Export to powerpoint / pdf
Screenshots
We save to S3 /cache/screenshots/ using same S3 connections as elsewhere. This way results are permanent and can be shared across sessions, local/remote etc.
When the screenshot button is pressed, if this is a presenting row, save a screenshot, overwriting any existing file with same key, regardless of cache toggle is switched off (in the background, async).
In addition, every time we make a gutterclick on a line (manually or programmatically ) and we receive parse_complete signal at the frontend to say that the output is ready, IF the line is a presenting line, wait 1s and then take a screenshot and upload it to remote storage, overwriting any existing shot with same key, and do this COMPLETELY IN THE BACKGROUND so we do not delay any other processing, but do send a notify_info() to say the screenshot has been sent, and report the key.
Whereas PDF button does NOT auto overwrite any existing file (skips the process instead) as long as the cache toggle is ON. If cache toggle is off, take new ones for every presenting row and upload and overwrite all.
Key Calculation for Screenshots
To ensure consistent and robust screenshot identification, we use a special key calculation method:
- For each presenting line, we collect all text up to and including that line.
- We process this text with a specialized algorithm:
- The last line (the actual presenting line) is preserved verbatim, including any comments or formatting characters.
- All previous lines are processed through the
discard_linesfunction to remove unnecessary content and focus only on the relevant code blocks. - These processed previous lines are combined with the preserved last line to form the text for hashing.
- A SHA-256 hash is calculated from this processed text, and the first 8 characters are used as the identifier.
This approach ensures that: - The hash remains stable even when irrelevant lines change - The hash correctly reflects the unique content of each presenting line - Comments and formatting in the presenting line are preserved and contribute to the hash - The same hash algorithm is used in both the frontend and backend
A button in the navbar to export the result of all the presenting rows to powerpoint/pdf, so that the result of each presenting row is on a separate slide (A4-landscape aspect ratio).
When the user clicks "Export to PDF" button to export to PDF,
- the frontend collects each presentation line and for each uses the key as described above;
- finds the corresponding screenshot, if it exists.
- Create screenshots for only those lines which do not yet have a screenshot (overwriting any existing screenshots if the cache toggle is OFF.)
Assemble the screenshots into a PDF
This approach ensures that the same key is used on both frontend and backend, avoiding any inconsistencies.
Our aim is to get this working later with a higher resolution for better quality but right now let's focus on getting it to work with acceptable images at a standard resolution.
Link
there is a link at bottom right taking you to the current project and line number online.
ToC
The table of contents slide will appear as the second slide in every PDF export, showing thumbnails of all subsequent slides. Users can click any thumbnail to jump directly to that slide, making navigation much easier for longer workflows.
Breadcrumb
if the input panel contains any markdown headings, then the breadcrumb is the relevant H1, H2, H3, etc within which the current line is positioned.
Editor
Hints (dropdowns)
Autcomplete is always fuzzy matching and matches middles as well as beginnings of strings.
Hint dropdowns: - are always filtered by what has been typed so far. - are dismissed as soon as user selects an entry with keyboard or mouse, or closes the dropdown with escape or clicks outside.
all of the hints should be fuzzy matching, and match not only the start. and as user types, the hint popup should reduce the matching set of selection offered
Folding
Editor content should be foldable on markdown headings and also code blocks inside fences. By default, fold heading level 2 and below on load.
Ctrl+Space for autocomplete Ctrl+F for find Ctrl+H for replace Alt+L or Ctrl+Alt+L for folding/unfolding the current block Alt+0 to fold all Alt+Shift+0 to unfold all Standard editing shortcuts like Ctrl+Z for undo and Ctrl+Y for redo
Commands and additional keywords
Hints are offered for the commands and the additional possible keywords. So for example, after "get " follow that immediately with file name suggestions even before one letter is typed.
Filenames
the frontend gets the list of files to offer in the autocomplete after "get ". the frontend gets the list of uploaded files to offer in the autocomplete after "download ".
Hints for Column names
Hints for column names should be offered for the current table. So eg if a table has had columns added or removed, the hints should be for the current table.
If user has on any subsequent line starting with "select ", immediately suggest colnames (no typing needed) colname1" then a space.
Hints appear after the user has typed a space after these commands:
(multiple hints, i.e. continue with more hints, after a selection is selected and user adds a space) - "select " - "sort " - "sortdown "
(single hints, i.e. no more hints after first is selected) - "filter " - "pivot " - "pivot xx by " - "chunk by "
Column values (statement_id link_id and source_id)
When a file is loaded, the frontend knows the unique values of the columns statement_id and source_id from the statements table and the unique values of the column link_id from the links table so they can be offered in autocomplete. so that when input is like this: 'get foobar statements filter statement_id is ' then, the unique values of statement_id are offered in the autocomplete.
same goes for source_id.
editor features
Sublime-like keyboard shortcuts:
Ctrl-/: Toggle comment Ctrl-D: Duplicate line Alt-Up/Down: Move line up/down Tab/Shift-Tab: Indent/outdent Ctrl-L: Select entire line Alt-Delete/Backspace: Delete to line end/start Ctrl-Enter: Insert line below Shift-Ctrl-Enter: Insert line above
Editing and Text Manipulation Ctrl + X: Delete line13 Ctrl + Enter: Insert line after13 Ctrl + Shift + Enter: Insert line before134 Ctrl + L: Select line (repeat to select next lines)134 Ctrl + D: Select word (repeat to select other occurrences)134 Ctrl + Shift + M: Select all contents of current parentheses13 Ctrl + J: Join line below to the end of the current line13 Ctrl + /: Comment/uncomment current line134 Ctrl + Shift + /: Block comment current selection13 Ctrl + Y: Redo or repeat last keyboard shortcut command13 Ctrl + Shift + V: Paste and indent correctly13 Ctrl + Space: Select next auto-complete suggestion13 Ctrl + U: Soft undo (jumps to last change before undoing)134 Multiple Selections and Column Editing Ctrl + Alt + Up: Column selection up (Windows)3 Ctrl + Alt + Down: Column selection down (Windows)3 Alt + Shift + Up/Down: Column selection (Linux)3
Auto-closing brackets and quotes Multiple cursors support (via sublime keymap) Line wrapping Better selection highlighting Search functionality (Ctrl-F)
Comments
add support from comments such that command foo bar #baz baz blah is equivalent to command foo bar
Uploading Data
This section describes how users can upload PDF or Excel files to the uploads server. Users have multiple options when uploading files concurrently.
Multiple File Uploads
If multiple files are uploaded:
- Default behavior: Each file is uploaded as a separate object. The global file name provided by the user is used as a base key, and each object's key is constructed by appending the actual uploaded file's name (e.g. fileName-uploadedFileName). If duplicate file names occur, a counter suffix (-1, -2, etc.) is appended to ensure uniqueness.
- Combine Toggle: If the Combine multiple files into one upload toggle is enabled, the server processes each file, extracts the corresponding data, and appends all rows into one aggregated dataset. In this case, a single object is uploaded using the provided global file name as the key. For each row in the aggregated result, the source_id column is updated to include the global file name concatenated with the original file name (e.g. fileName-uploadedFileName), thereby maintaining traceability to the original source.
PDFs
Best way to upload PDF from docx is just to print and/or combine/print them locally, on Windows with PDF gear
Users can upload one or more PDF files to the uploads server via the UI, users can drag and drop multiple PDF files onto the upload button. The following steps occur: - A modal dialog prompts the user to type a name for the file (the global file name) and/or select from an existing file (from the files table in the database). A toggle button allows the user to choose whether to add (append) or replace rows. - If the provided global file name already exists, the user may use the checkbox to overwrite the existing file.
Mupdf is used to extract text from each PDF. One row is created per extracted page and inserted into a table with the following columns:
- text: The extracted text from the page.
- file: The global file name provided by the user.
- source_id: The name of the original PDF file from which the page was extracted. If the combine toggle is enabled, this is updated to globalFileName-uploadedFileName.
- statement_id: A serial number as a character string, zero-padded (e.g. 0001, 0002, etc.) based on the number of pages in the file.
- modified: The current timestamp (format: YYYY-MM-DD HH-MM-SS.ms).
Excel Files (xlsx) and .csv
The above process is extended to Excel (.xlsx, .xls) files. Data is extracted from each sheet and processed in a similar manner to PDF files.
PDFs and spreadsheet files can be combined into a single upload.
Text (.txt) and RTF files (.rtf)
NEW: offer RTF files as an option, and if files are rtf, ask for a regex delimiter between statements and whether the matched text should be included at the start of the statement or the end of the statement. Offer previous delimiters (from local storage) also as options.
Authorisation
users can only overwrite files if they are owner or editor of the file
Caching
We have several different caching mechanisms.
Browser tabs note the same user with the same browser can get crosstalk between tabs of the same browser
LocalCache class (local in-memory cache):
1. Line-based Caching:
-
Storage: set_local_cached_result(lines, result)
-
Creates a key from filtered command lines
-
Stores results with copy semantics for DataFrames
-
Used during command processing to cache intermediate results
-
Retrieval: get_local_cached_result(lines)
-
Returns cached results along with the number of cached lines
-
Ignores error results that start with "Error:"
-
Returns (0, None) if no cache match is found
2. Named Result Caching:
-
Storage: set_local_named_result(name, result, sess_id) (NEW: set_local_named_result(name, result, projectname))
-
Handles DataFrames specially, preventing empty DataFrame storage
-
Used for the store command functionality
-
Names are composed like this: named/name_of_current_project/name
so if I do
store foo
in a project called say foldername|myproject-v87
the actual name will be named/foldername|myproject/foo. Note we strip the version number.
the "named/" prefix is useful when we also cache these at S3.
Note that project names may have one or more optional pipe characters. This is irrelevant here.
(note that previously the key used to include just name and the session ID, to avoid crosstalk between sessions using simple, ambiguous names like say "links", but we are not using that any more, as these names are now unique).
At the moment we rely on user to keep these names unique within a project.
NEW: on gutterclick, provide a frontend check whether the editor contains any duplicate lines of form store foo. If more than one such line with same foo, show a notifyError: not recommended and refuse to send result to backend
when setting or getting local or remote named result cache, give a notify_info() with the full name of the object including projectname etc.
NEW: temporarily disable remote caching of named results: don't store or retrieve them. comment out or skip the relevant lines.
- Retrieval: get_local_named_result(name, projectname)
we now add remote backup to S3 exactly as we do for line-based caching. Also add a overwrite=true/false keyword to the store command, case insensitive for true and false, default true. If overwrite is false and there is an existing result with the same key, the attempt will fail with a notify_error()
so as usual, if user stores a named result, we store it locally in the background and also remotely in background. if user recalls aka unstores a named result, if it is available locally, fetch it
3. Cache Management:
-
Clearing: clear() method empties local cache and triggers garbage collection
-
REST API: Via create_local_cache_blueprint() function that creates a route for clearing the cache
Usage in the Application
The main usage is in process_parse_request() in app.py, which:
-
Retrieves cached results for the current set of command lines
-
Falls back to computation when there's no cached result
-
Stores results after processing each line
-
Handles explicit storage via the store and recall commands
Parsing for line-based caching NEW and also ensuring relevant named results are loaded
parsing the input text in my app, the input for each line is the result of the previous line.
The local LocalCache is used specifically for caching intermediate results during parsing operations. This is useful because:
- It caches intermediate results during app use
- It handles named results (the store/recall command functionality)
when user sends a line with gutterclick, we are really sending a block: this last line with its commands along with all preceding relevant lines of text, one command per line. To process this block, then using discard_lines() we clean the text from comments etc and then delete sections of text which are not ancestors, resulting in a set of commands (final_set) all of which are necessary to compute the result we need.
But now we have final_set we can check the cache before running anything. We work back from the current line n, subtracting 1 each time, constructing a cache key from all the lines up to the current line m. (we check local then remote cache). As soon as we find a cached result on line m, we use that as the initial input and run the contents of the remaining block with lines (m+1) to n: first we run line m+1 on this result, then line m+2 on that result, etc, (caching each result in the background, locally then remotely and also caching named results) - there is no need to check for cached results for these lines 1-m because we already know there are aren't any).
Note that this caching of intermediate results just mentioned means that processing a line like store foo as part of a processing chain might mean caching it locally, then remotely, and then also additionally caching the named result e.g. as named/foldername|myproject/foo locally.
The enable_cache button (next to clearCache button) is a toggle, default ON. when off, we disable get_cached_result (but not set_cached_result). So we run everything when required from scratch (but the results are saved in the reverse direction to the localcache and to remotecache, even if the key already exists)
Normally there is no reason not to use the cache because the text uniquely defines its results. But there are a couple of exceptons: - list uploads - links xfile - when the app code has been updated and maybe a visualisation looks different
How discard_lines() works
input texts may often have this kind of form (the numbering is for explication, they are not really numbered)
1. download xyz (or fetch, or get or scrape)
2. some processing
3. store abc
4. some other processing
5. unstore abc
6. some more processing
...
so if we send the text up to 6. some more processing to the parser, we can actually ignore line 4 (which may actually be a whole block, because it doesn't affect line 6). This means that we can update the existing input processing:
we already strip out
- empty lines,
- strip whitespace from line ends
- strip comments beginning # from middle of line, eg reduce command someparam etc # comment etc to command someparam etc
- replace double whitespace with single
- lines starting with a comment i.e. with # or - or >
- fenced blocks (blocks beginning and ending with ```)
and in addition, divide the resulting text into "blocks". Each block begins with the first line and also with any of these "block-starters": - download - fetch - scrape - get - links - statements - sources - unstore - recall - list - share - unshare
First identify the last block (the one including the current last line) and apply to it the "needs block check", that means, identify within it any "NEEDY" foo, bar etc in these NEEDY contexts:
NORMAL NEEDY
- unstore foo
- recall foo
NAMED NEEDY
- prompt instructions=foo
- join foo
- append foo
- relabel links=foo
- magnetise table=foo
- join table=foo
- groupify(foo,bar...)
- append table=foo
- append text=foo
- print ... highlights=foo [[NEW!!]]
And add them to a list called need_list NEW: ONLY IF WE DON'T ALREADY HAVE foo AS A NAMED RESULT IN LOCAL OR REMOTE CACHE. IF IT IS ALREADY ACCESSIBLE AS A NAMED RESLT, DON'T ADD IT TO THE LIST.
The point about NEEDY commands is they have two (or more) inputs; the preceding result plus one or more named results.
Then, scan the blocks in reverse (from the current, last block upwards) through each block which is NOT the last block (the one including the current last line).
For each block:
- if it does NOT contain any line of the form: store baz where baz is in need_list, delete it.
- if it does,
- note the hitlist of foos in your needs list for which there are lines within the block of form store foo
- identify THE LAST_LINE WHICH IS LIKE store baz where baz is in your hitlist (there may be multiple store xxx matches in your needs list, so there may be one or more other relevant store xxx lines above it in the same block),
- PRESERVE the lines within this block from the top of it up to an including this LAST_LINE BUT DELETING ANY LINES AFTER THAT.
- SUBTRACT all the names in your hitlist from the needs_list.
- ADD TO YOUR needs_list any more NEEDY foos in the remaining block (see definition of NEEDY above) NEW: ONLY ADD IT IF THE NAMED RESULT IS NOT ALREADY ACCESSIBLE IN LOCAL OR REMOTE CACHE
- Then repeat this scan with the next block above and so on until you reach the top of the file.
This will result in a new version of the input text, with one or more block or block sections discarded. The point of this is to delete any code which we do not need in order to completely run all the code in the current block.
Use this version of the input text.
HOWEVER: if we are running a line n with a command in the NAMED NEEDY list which NEEDS an additional named result and we find a cache hit for line m, return that, and proceed to run lines m to n, one by one, when we come to line n (or even before) we won't be able to run line n because we may not yet have named result foo. SO we must also find out which named results are still NEEDED and which are created in lines 1-m, and identify the lines b, c, d with store commands where they are created, and before running lines 1-m we also have to run the lines b, c, d, which will result in them being stored so we can use them when we run lines m-n.
RemoteLinesCache: new backup to line-based caching
The application now has an S3 store which backs up LocalCache (line-based) results so they persist across sessions:
- When saving a (line-based) result to LocalCache, we automatically check if it exists in the S3 remote cache. If not, we save it.
- When looking for a result in LocalCache, if there's a miss, we automatically check the S3 remote cache before executing the code. If found in S3, we add it to LocalCache and return the result.
Implementation details:
- Results are stored in a cache/lines_cache/ subdirectory within the S3 bucket
- DataFrames are efficiently serialized to Parquet format
- Other objects use Python's pickle serialization
- The cache is key-based, with keys derived from the command lines
- Cache operations include exponential notification throttling to avoid UI clutter
- A new API endpoint /api/cache_status provides statistics about both local and remote caches
- Clear options are available via:
- /api/clear_local_cache - Clears only the in-memory local cache
- /api/clear_remote_cache - Clears only the S3 remote cache
- /api/clear_all_caches - Clears both local and remote caches
No changes to the main application code were required - the functionality was implemented by enhancing the LocalCache.set_local_cached_result() and LocalCache.get_local_cached_result() methods to integrate with the new RemoteLinesCache class.
remote_prompt_cache (MongoDB persistent cache) with S3 offloading:
This is a NoSQL MongoDB cache suitable for long-term caching of results, especially for the results of prompt processing, which may be expensive. We don't utilise enough of the ability of MongoDB to do nested searches except for the embeddings cache and could just use S3 for most of it.
When the cache is full, it sends the oldest entries first to S3, from where they can be retrieved if necessary.
upload.py: uploaded results of PDF files.
Not strictly a cache but a store of the results of uploading PDF files. Doesn't really need to be in MongoDB.
remote_prompt_cache at MongoDB: RemotePromptCache: OpenAI response aka prompt cache for whole prompt command
THIS IS DISABLED AT THE MOMENT.
Entire prompt commands are cached in mongoDB. However, this is a monolithic approach. So if we have a prompt command which is chunked into multiple calls by, say statement_id 1 to 10, then we cache the entire command results. But within the prompt loop, we also cache the results of individual calls to openAI as follows.
remote_prompt_cache at MongoDB: OpenAI response aka prompt cache for individual prompt calls
Within the prompt loop, also cache the results of individual calls to openAI. so if promptchaching finds the results of the entire prompt call at openAI, use that. but if not, within the loop, check to see if the results of each call have been cached and if so use that; and if not, call the API but then cache the individual results to mongodb (as well as, when the command is complete, caching the entire result to mongodb as usual). The current filename is also cached with each call.
HOWEVER: Even if a call fails, we cache the results of that call as a NULL result rather than simply skipping it.
there is a setting 'overwrite' in the prompt command. Assuming that there are already some NULL results in the MongoDB: - If overwrite=failed, then we identify only the individual calls that failed, and attempt to regenerate only those. - If it is 'all', then we regenerate all the calls. - Default setting is 'none', which means we use all the cached results including failed ones (which are empty/NULL). We reprocess only the calls that have no result at all cached, and cache the results of those as usual.
Memory Usage Indicator
The memory usage indicator, located in the top right of the UI, shows the current memory usage of the Python process.
Clicking the "Check Memory" button sends a request to the /api/memory-usage endpoint, which uses the psutil library to retrieve the current memory usage. This is then displayed next to the button.
Clicking the "Clear Cache" button not only clears the application cache, but also triggers garbage collection using gc.collect() to attempt to free up memory. After clearing the cache, it automatically updates the memory usage display.
This allows monitoring the memory footprint of the application and taking action if it grows too large.
technical overview
The app's workflow is:
- Frontend sends input to /api/parse endpoint
- Backend immediately returns a 202 status with "processing" message, then starts background task
- Backend uses SSE to:
- Send status updates during processing
- Send final results when complete
For results:
- Network graphs/visualizations are sent directly in SSE message
- Tables send first 50 rows in SSE message, with pagination info
- Frontend can request more rows via pagination API
Frontend handles results by:
- Displaying immediate visualizations
- Setting up Tabulator table with client-side pagination for datasets
- Showing any error messages
Global vs User-specific caching
We don't need user-specific caching of results because all results are keyed with an input string and the result is always a strict function of the user string.
However: - Notifications need to be user-specific as they are state dependent - Data_store needs to be user specific as they are not keyed by user string
My app gets a unique firebase_user_id from firebase. This should just be used for authentication and login/out. It also generates a unique session-based token called sess_id for data stores, notifications etc. @main.js but then in parser.js it seems like the firebase auth is happening all over again? Can't this use the auth from main.js?
Session Management and Connection Robustness
- The app uses Server-Sent Events (SSE) for real-time communication between frontend and backend.
- The frontend automatically detects if the SSE connection is lost (e.g., due to network issues or backend restart). If no message is received for 30 seconds, the user is notified and the connection is automatically re-established.
- If the backend cannot deliver a message (e.g., due to a dropped connection), it stores the message and delivers it as soon as the user reconnects.
- Each session has a maximum lifetime of 12 hours and an idle timeout of 10 minutes. If either is exceeded, the backend cleans up all resources for that session and notifies the user with an error message ("Your session has expired due to inactivity/maximum session lifetime. Please reload the page.").
- This ensures memory is reclaimed on the server and users are always kept informed about their session status.