Data labeling

V7 & Databricks: How to Turbocharge Your AI & MLOps

5 min read

Jun 13, 2023

We are proud to announce the partnership of V7 and Databricks! Transform your unstructured data into training data and ground truth quickly and accurately thanks to V7’s integration into the Databricks platform.

V7 logo

V7 Team

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

A data labeling tool where a medical image is being labeled as Basophil Cell

Data labeling

Data labeling platform

Get started today

We are proud to announce the partnership of V7 and Databricks! 

V7’s integration into the Databricks platform will allow customers to transform their unstructured data into training data and ground truth quickly and accurately. V7 fills a relevant and growing need for Databricks’ customers, focusing on the Healthcare and Industrials verticals and unique video and workflow management features.

The collaboration will enable customers to create a dataset in Databricks, annotate the data in V7 and then load the annotations back into Databricks for easy querying and model training. V7 will also support pre-annotations performed in Databricks, which you’ll be able to push into the platform. In the future, you’ll also be able to connect Databricks models to V7.​​

Combining the powers of both tools will enable customers to better manage machine learning workflows and their underlying data—at the scale required to build better AI products.

Solving AI development challenges together

Preparing large amounts of data for big-scale machine learning projects is often challenging. Paired with concerns about security, tiresome data transfers, and annotation efforts, it can significantly slow product delivery time.

This partnership aims to solve these problems, helping joint customers manage data securely and efficiently, and build powerful machine learning workflows—ensuring maximum quality of training data and significantly speeding up time-to-market.

Who are V7 and Databricks?

V7 is the AI data engine enabling better AI products to reach the market faster, with reduced costs and less risk. Its key features include auto-labeling, dataset management, customizable workflows, and the highest security standards. 

Used by enterprise customers worldwide, V7’s unique workflows enable 10x faster labeling. The platform balances automation with human input in a beautiful and intuitive UX, allowing customers to transform raw data into value and innovate with speed and accuracy.

Databricks is a data and AI company. The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance, and performance of data warehouses with the openness, flexibility, and machine learning support of data lakes.

Built on an open lakehouse architecture, AI and Machine Learning on Databricks empowers ML teams to prepare and process data, streamlines cross-team collaboration and standardizes the full ML lifecycle from experimentation to production, including for generative AI and large language models.

Key benefits of V7 & Databricks partnership

Together, these two powerful and fast-growing players in the data and machine learning world deliver fresh capabilities to their customers. Here’s what joint customers can gain:

  • Smoother data management experience. Combining V7 and Databricks lets you easily create workflows where data transfer and labeling take only a few lines of code. All unstructured data is easily manageable via the PySpark library and V7 API.

  • Support for all data formats. Whether you need traditional images, videos, or medical imagery, V7 will handle all your data in its native format. Notably, V7 supports DICOM, NIfTI, and WSI imagery, making it the right choice for healthcare and life sciences AI. 

  • Video specialism. Video streaming is a frequent problem for computer vision tasks that V7 has strategically addressed by building a best-in-class streaming, video timeline, and model integration offering.

  • Robust integration ecosystem. The V7 API, Python SDK, or CLI lets you easily extend your data workflows. You can host your data in enterprise cloud storage, connect with other MLOps platforms, or seamlessly integrate the annotated data into any machine learning framework.

  • Full data safety. V7 is SOC 2, HIPAA, and FDA Part 11 compliant.

  • Vertical expertise. V7 covers all markets and regions, however it has significant experience and customer success in Healthcare & Life Sciences and Industrials & Manufacturing verticals. 

See how V7 helped InformAI in building an organ volume estimation model achieving 97% accuracy

How to use V7 + Databricks

Take a look at this quick video introducing the integration.

And here’s a reference architecture diagram that showcases how the interaction works:

The partnership enables the joint users to seamlessly fit both V7 and Databricks together and into their wider technical stack.

The customers can utilize Databricks’ Delta Lake features, such as ACID transactions and Parquet for columnar storage. They can then effortlessly and securely send this data to and from V7.

It’s possible thanks to our darwinpyspark library—a wrapper around the V7 API, which lets its users:

  • Upload data from a PySpark DataFrame to V7

  • Download data from V7 and load it into a PySpark DataFrame

  • Handle data registration, uploading, and confirmation with V7

  • Efficiently manage large datasets and data exports

Let’s go through a quick rundown on how to set it up.

Installation

pip install darwinpyspark

Usage

This framework is designed to be used alongside Python SDK. You can see examples of darwin-py in the V7 docs here.

To get started with DarwinPyspark, you'll first need to create a DarwinPyspark instance with your V7 API key, team slug, and dataset slug:

from darwinpyspark import DarwinPyspark

API_KEY = "your_api_key"
team_slug = "your_team_slug"
dataset_slug = "your_dataset_slug"

dp = DarwinPyspark(API_KEY, team_slug, dataset_slug)

Uploading data

To upload a PySpark DataFrame to V7, use the upload_items method:

# Assume `df` is your PySpark DataFrame with columns 'object_url' and 'file_name'
dp.upload_items(df)

The upload_items method takes a PySpark DataFrame with columns 'object_url' (accessible open or pre-signed URL for the image) and 'file_name' (the name you want the file to be listed as in V7).

Managing data in V7

V7 lets you annotate files such as images, videos, or DICOM, train and test models, set up custom ML workflows, and more.

If you’re new to the tool, take a look at this quick rundown:

Or jump to the documentation to see how V7 can help your specific use case! You can also check the github repo or our pypi page for more information.  

Downloading data

To download data from V7 as a PySpark DataFrame, use the download_export method:

export_name = "your_export_name"
export_df = dp.download_export(export_name)
A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

A video labeling annotation tool where drone footage of a port inspection is being annotated

Video annotation

AI video annotation

Get started today

What’s next?

We are delighted to have this partnership live, and we look forward to exploring adding value to your AI projects.

Do you have any feedback, comments, or questions regarding the integration? Let us know. We’d like to hear from you and we promise to respond quickly to any cool idea!

Keep an eye out for upcoming webinars and materials by following V7 on LinkedIn.

We look forward to helping you transform your data into software. Connect with us on LinkedIn or book a chat using the links below:

Next steps

Label videos with V7.

Rewind less, achieve more.

Try our free tier or talk to one of our experts.

Next steps

Label videos with V7.

Rewind less, achieve more.