Download StreamSets
Author: m | 2025-04-25
The StreamSets Data Collector Docker image download is available from within the StreamSets Data Integration Platform. Option B: download the StreamSets Data Collector engine Tarball
streamsets/tutorials: StreamSets Tutorials - GitHub
Skip to content Navigation Menu GitHub Copilot Write better code with AI Security Find and fix vulnerabilities Actions Automate any workflow Codespaces Instant dev environments Issues Plan and track work Code Review Manage code changes Discussions Collaborate outside of code Code Search Find more, search less Explore Learning Pathways Events & Webinars Ebooks & Whitepapers Customer Stories Partners Executive Insights GitHub Sponsors Fund open source developers The ReadME Project GitHub community articles Enterprise platform AI-powered developer platform Pricing Provide feedback Saved searches Use saved searches to filter your results more quickly ;ref_cta:Sign up;ref_loc:header logged out"}"> Sign up DataOps for agile data movement Verified We've verified that the organization streamsets controls the domain: streamsets.com Overview Repositories Projects Packages People Popular repositories Loading StreamSets Tutorials Java 348 191 Dockerfiles for StreamSets Data Collector Shell 114 78 datacollector-oss Java 95 103 Pipeline library for StreamSets Data Collector and Transformer 33 37 StreamSets Test Framework-based tests for StreamSets Data Collector Python 18 29 Helm Charts Mustache 11 25 Repositories --> Type Select type All Public Sources Forks Archived Mirrors Templates Language Select language All C Dockerfile Go HTML Java Mustache PHP Python Shell TypeScript Sort Select order Last updated Name Stars Showing 10 of 46 repositories datacollector-tests Public archive StreamSets Test Framework-based tests for StreamSets Data Collector streamsets/datacollector-tests’s past year of commit activity Python 18 Apache-2.0 29 3 34 Updated Feb 17, 2025 streamsets/datacollector-docker’s past year of commit activity Shell 114 Apache-2.0 78 5 1 Updated Feb 12, 2025 streamsets/clusterdock’s past year of commit The StreamSets Data Collector Docker image download is available from within the StreamSets Data Integration Platform. Option B: download the StreamSets Data Collector engine Tarball Creating a Custom Processor for StreamSets TransformerStreamSets Transformer combines the power of Apache Spark with the ease of use of StreamSets' award winning Data Collector. You can build dataflow pipelines that aggregate data across time, join incoming streams of data, and scale across your Spark cluster. Processors are the core of Transformer's pipelines, implementing operations on Spark DataFrames.This tutorial explains how to create a simple custom processor, using Java and Scala, that will compute the type of a credit card from its number, and configure Transformer to use it.PrerequisitesDownload and install StreamSets Transformer.Oracle Java Development Kit (JDK) 1.8 or later is needed to compile Java code and build JAR files.Scala version 2.10 or later.Maven 3.3.9 or higher is needed to manage the JAR file build process.Transformer includes the Spark libraries required to preview dataflow pipelines. You will need an Apache Spark 2.3 (or higher) distribution to run the pipeline.Implementing a Skeleton ProcessorThe main class of the processor is written in Scala and extends the com.streamsets.datatransformer.api.spark.SingleInputSparkTransform abstract class, implementing the transform(input: SparkData): SparkData and, optionally, init(): util.List[ConfigIssue] and destroy() methods.Here's a minimal implementation that simply returns its input as its output:package com.example.processor.sampleimport java.utilimport com.streamsets.datatransformer.api.spark.SingleInputSparkTransformimport com.streamsets.pipeline.api.ConfigIssueimport com.streamsets.pipeline.spark.SparkData/** Sample processor. * * @constructor create a new processor */class SampleProcessor extends SingleInputSparkTransform { /** * Initializes the processor. * * @return a List of any [[com.streamsets.pipeline.api.ConfigIssue]]s found by the super class constructor */ override def init(): util.List[ConfigIssue] = { val issues = super.init() if (issues.size() == 0) { // Perform any initialization } issues } /** * Transforms the input [[com.streamsets.pipeline.spark.SparkData]] into the * output. * * @param input [[com.streamsets.pipeline.spark.SparkData]] containing a [[org.apache.spark.sql.DataFrame]] * @return output [[com.streamsets.pipeline.spark.SparkData]] containing output data */ override def transform(input: SparkData): SparkData = { var df = input.get() // Short circuit if no incoming data if (df.count() == 0) return input // Apply required operations on the DataFrame before returning it in a // new SparkData object new SparkData( df ) }}Clone the sample processor git repository, checkout the skeleton tag, and examine the above code there. You will also see a couple of supporting Java classes and a default icon for the processor. We'll look at those more closely later.Now build the project with mvn clean package:-----------[INFO] Building StreamSets Example Processor 3.9.0-SNAPSHOT[INFO] --------------------------------[ jar ]---------------------------------...output omitted...[INFO] Building jar: /Users/pat/src/custom_processor/target/streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 11.370 s[INFO] Finished at: 2019-04-02T09:21:40-07:00[INFO] ------------------------------------------------------------------------">$ mvn clean package[INFO] Scanning for projects...[INFO] [INFO] ----------------------[INFO] Building StreamSets Example Processor 3.9.0-SNAPSHOT[INFO] --------------------------------[ jar ]---------------------------------...output omitted...[INFO] Building jar: /Users/pat/src/custom_processor/target/streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 11.370 s[INFO] Finished at: 2019-04-02T09:21:40-07:00[INFO] ------------------------------------------------------------------------You should see a jar file in the target directory:$ ls targetanalysis maven-statusclasses streamsets-example-processor-lib-3.9.0-SNAPSHOT.jargenerated-sources surefire-reportsmaven-archiver test-classesCopy the jar file, streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar to Transformer's api-lib. Note - we willComments
Skip to content Navigation Menu GitHub Copilot Write better code with AI Security Find and fix vulnerabilities Actions Automate any workflow Codespaces Instant dev environments Issues Plan and track work Code Review Manage code changes Discussions Collaborate outside of code Code Search Find more, search less Explore Learning Pathways Events & Webinars Ebooks & Whitepapers Customer Stories Partners Executive Insights GitHub Sponsors Fund open source developers The ReadME Project GitHub community articles Enterprise platform AI-powered developer platform Pricing Provide feedback Saved searches Use saved searches to filter your results more quickly ;ref_cta:Sign up;ref_loc:header logged out"}"> Sign up DataOps for agile data movement Verified We've verified that the organization streamsets controls the domain: streamsets.com Overview Repositories Projects Packages People Popular repositories Loading StreamSets Tutorials Java 348 191 Dockerfiles for StreamSets Data Collector Shell 114 78 datacollector-oss Java 95 103 Pipeline library for StreamSets Data Collector and Transformer 33 37 StreamSets Test Framework-based tests for StreamSets Data Collector Python 18 29 Helm Charts Mustache 11 25 Repositories --> Type Select type All Public Sources Forks Archived Mirrors Templates Language Select language All C Dockerfile Go HTML Java Mustache PHP Python Shell TypeScript Sort Select order Last updated Name Stars Showing 10 of 46 repositories datacollector-tests Public archive StreamSets Test Framework-based tests for StreamSets Data Collector streamsets/datacollector-tests’s past year of commit activity Python 18 Apache-2.0 29 3 34 Updated Feb 17, 2025 streamsets/datacollector-docker’s past year of commit activity Shell 114 Apache-2.0 78 5 1 Updated Feb 12, 2025 streamsets/clusterdock’s past year of commit
2025-04-11Creating a Custom Processor for StreamSets TransformerStreamSets Transformer combines the power of Apache Spark with the ease of use of StreamSets' award winning Data Collector. You can build dataflow pipelines that aggregate data across time, join incoming streams of data, and scale across your Spark cluster. Processors are the core of Transformer's pipelines, implementing operations on Spark DataFrames.This tutorial explains how to create a simple custom processor, using Java and Scala, that will compute the type of a credit card from its number, and configure Transformer to use it.PrerequisitesDownload and install StreamSets Transformer.Oracle Java Development Kit (JDK) 1.8 or later is needed to compile Java code and build JAR files.Scala version 2.10 or later.Maven 3.3.9 or higher is needed to manage the JAR file build process.Transformer includes the Spark libraries required to preview dataflow pipelines. You will need an Apache Spark 2.3 (or higher) distribution to run the pipeline.Implementing a Skeleton ProcessorThe main class of the processor is written in Scala and extends the com.streamsets.datatransformer.api.spark.SingleInputSparkTransform abstract class, implementing the transform(input: SparkData): SparkData and, optionally, init(): util.List[ConfigIssue] and destroy() methods.Here's a minimal implementation that simply returns its input as its output:package com.example.processor.sampleimport java.utilimport com.streamsets.datatransformer.api.spark.SingleInputSparkTransformimport com.streamsets.pipeline.api.ConfigIssueimport com.streamsets.pipeline.spark.SparkData/** Sample processor. * * @constructor create a new processor */class SampleProcessor extends SingleInputSparkTransform { /** * Initializes the processor. * * @return a List of any [[com.streamsets.pipeline.api.ConfigIssue]]s found by the super class constructor */ override def init(): util.List[ConfigIssue] = { val issues = super.init() if (issues.size() == 0) { // Perform any initialization } issues } /** * Transforms the input [[com.streamsets.pipeline.spark.SparkData]] into the * output. * * @param input [[com.streamsets.pipeline.spark.SparkData]] containing a [[org.apache.spark.sql.DataFrame]] * @return output [[com.streamsets.pipeline.spark.SparkData]] containing output data */ override def transform(input: SparkData): SparkData = { var df = input.get() // Short circuit if no incoming data if (df.count() == 0) return input // Apply required operations on the DataFrame before returning it in a // new SparkData object new SparkData( df ) }}Clone the sample processor git repository, checkout the skeleton tag, and examine the above code there. You will also see a couple of supporting Java classes and a default icon for the processor. We'll look at those more closely later.Now build the project with mvn clean package:-----------[INFO] Building StreamSets Example Processor 3.9.0-SNAPSHOT[INFO] --------------------------------[ jar ]---------------------------------...output omitted...[INFO] Building jar: /Users/pat/src/custom_processor/target/streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 11.370 s[INFO] Finished at: 2019-04-02T09:21:40-07:00[INFO] ------------------------------------------------------------------------">$ mvn clean package[INFO] Scanning for projects...[INFO] [INFO] ----------------------[INFO] Building StreamSets Example Processor 3.9.0-SNAPSHOT[INFO] --------------------------------[ jar ]---------------------------------...output omitted...[INFO] Building jar: /Users/pat/src/custom_processor/target/streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 11.370 s[INFO] Finished at: 2019-04-02T09:21:40-07:00[INFO] ------------------------------------------------------------------------You should see a jar file in the target directory:$ ls targetanalysis maven-statusclasses streamsets-example-processor-lib-3.9.0-SNAPSHOT.jargenerated-sources surefire-reportsmaven-archiver test-classesCopy the jar file, streamsets-example-processor-lib-3.9.0-SNAPSHOT.jar to Transformer's api-lib. Note - we will
2025-04-25To register IBM StreamSets as a service provider in AD FS, use the IdP information that you retrieved from Control Hub to create a relying party trust in AD FS. Then, configure a claims issuance policy for the trust to send email addresses and optionally user names of Active Directory Domain Services (AD DS) users to IBM StreamSets. Any user in AD DS can log in to IBM StreamSets, as long as the user is invited to the Control Hub organization using the AD DS email address. Note: These steps provide brief instructions to create a relying party trust using the AD FS Management tool installed on Windows Server 2019. For detailed steps, see the Microsoft AD FS documentation. Open Server Manager on the server that is running AD FS, and then click . Right-click the Relying Party Trusts folder, and then select Add Relying Party Trust. In the Welcome page of the wizard, select Claims aware, and then click Start. In the Select Data Source page of the wizard, select Import data about the relying party from a file, and then click Browse and select the metadata XML file that you downloaded from Control Hub. Click Next. In the Specify Display Name page of the wizard, enter a display name. For example, you might enter StreamSets SAML. Click Next. In the Choose Access Control Policy page of the wizard, choose the policy required by your corporate regulations, and then click Next. In the Ready to Add Trust page of the wizard, verify your configurations, and then click Next. In the Finish page of the wizard, select Configure claims issuance policy for the application, and then click Close. The Edit Claims Issuance Policy for dialog box appears. Click Add Rule. In the Choose Rule Type page of the claim rule wizard, select Send LDAP Attributes as Claims for the Claim rule template property. Click Next. In the Configure Claim Rule page of the wizard, enter a name for the rule. For example, you might enter StreamSets Attribute Mappings. For the Attribute store property, select Active Directory. In the Mappings table,
2025-04-15