encode_utils.connection

encode_utils.connection.LOG_DIR = 'EU_Logs'

The directory that contains the log files created by the Connection class.

class encode_utils.connection.Connection(dcc_mode=None, dry_run=False, submission=False, no_log_file=False)[source]

Bases: object

Handles communication with the Portal regarding data submission and retrieval.

For data submission or modification, and working with non-released datasets, you must have the environment variables DCC_API_KEY and DCC_SECRET_KEY set. Check with your DCC data wrangler if you haven’t been assigned these keys.

There are three log files opened in append mode in the directory specified by connection.LOG_DIR that are specific to whichever Portal you are connected to. When connected to Production, each log file name will include the token ‘_prod_’. For Development, the token will be ‘_dev_’. The three log files are named accordingly in reference to their purpose, and are classified as:

  1. debug log file - All messages sent to STDOUT are also written to this log file. In addition, all messages written to the error log file described below are logged here.
  2. error log file - Only terse error messages are sent to this log file for quick scanning of any potential issues. If you identify an error here that needs more explanation, then you should consult the debug log file.
  3. posted log file - Tabulates what was successfully POSTED. There are three tab-delimited colummns ordered as submission timestamp, record alias, and record accession (or UUID if the accession property doesn’t exist for the profile of the record at hand). Note that if a record has several aliases, then only the first one in the list for the aliases property is used.
ENCID_KEY = '_enc_id'

Identifies the name of the key in the payload that stores a valid ENCODE-assigned identifier for a record, such as alias, accession, uuid, md5sum, … depending on the object being submitted. This is not a valid property of any ENCODE object schema, and is used in the patch() instance method to designate the record to update.

PROFILE_KEY = '_profile'

Identifies the name of the key in the payload that stores the ID of the profile to submit to. Like ENCID_KEY, this is a non-schematic key that is used only internally.

POST = 'post'

Constant

PATCH = 'patch'

Constant

debug_logger = None

A reference to the debug logging instance that was created earlier in encode_utils.debug_logger. This class adds a file handler, such that all messages sent to it are logged to this file in addition to STDOUT.

dcc_modes = None

An indication of which Portal instance to use. Set to ‘prod’ for the production Portal, and ‘dev’ for the development Portal. Alternatively, you can set an explicit host, such as demo.encodedcc.org. Leaving the default of None means to use the value of the DCC_MODE environment variable.

dry_run = None

Set to True to prevent any server-side changes on the ENCODE Portal, i.e. PUT, POST, PATCH, DELETE requests will not be sent to the Portal. After-POST and after-PATCH hooks (see the instance method after_submit_hooks()) will not be run either in this case. You can turn off this dry-run feature by calling the instance method set_live_run().

error_logger = None

A logging instance with a file handler for logging terse error messages. The log file resides locally within the directory specified by the constant connection.LOG_DIR. Accepts messages >= logging.ERROR.

post_logger = None

A logging instance with a file handler for logging successful POST operations. The log file resides locally within the directory specified by the constant connection.LOG_DIR. Accepts messages >= logging.INFO.

auth

Sets the API and secret keys to use when authenticating with the DCC servers. These are determined from the values of the DCC_API_KEY and DCC_SECRET_KEY environment variables via the _get_api_keys_from_env() private instance method.

set_submission(status)[source]

Sets the boolean value of the self.submission attribute.

Parameters:statusbool.
check_dry_run()[source]

Checks if the dry-run feature is enabled, and if so, logs the fact. This is mainly meant to be called by other methods that are designed to make modifications on the ENCODE Portal.

Returns:The dry-run feature is enabled. False: The dry-run feature is turned off.
Return type:True
set_dry_run()[source]

Enables the dry-run feature and logs the fact.

set_live_run()[source]

Disables the dry-run feature and logs the fact.

log_error(msg)[source]

Sends ‘msg’ to both self.error_logger and self.debug_logger.

add_alias_prefix(aliases, prefix=False)[source]

Given a list of aliases, adds the lab prefix to each one that doesn’t yet have a prefix set. The lab prefix is taken as the passed-in prefix, otherwise, it defaults to the DCC_LAB environment variable, and it must be a value assigned by the DCC, i.e. “michael-snyder” for the Snyder Production Center. The DCC requires that aliases be prefixed in this manner.

Parameters:
  • aliaseslist of aliases.
  • prefixstr. The DCC assigned lab prefix to use. If not specified, then the default is the value of the DCC_LAB environment variable.
Returns:

list.

Raises:

Exception – A passed-in alias doesn’t have a prefix set, and the default prefix could not be determined.

Examples:

add_alias_prefix(aliases=["my-alias"],prefix="michael-snyder")
# Returns ["michael-snyder:my-alias"]

add_alias_prefix(aliases=["michael-snyder:my-alias"],prefix="michael-snyder")
# Returns ["michael-snyder:my-alias"]

add_alias_prefix(aliases=["my_alias"], prefix="bad-value")
# Raises an Exception since this lab prefix isn't from a registered source record on
# the Portal.
get_aliases(dcc_id, strip_alias_prefix=False)[source]

Given an ENCODE identifier for an object, performs a GET request and extracts the aliases.

Parameters:
  • dcc_idstr. The ENCODE ID for a given object, i.e ENCSR999EHG.
  • strip_alias_prefixbool. True means to remove the alias prefix if all return aliases.
Returns:

The aliases.

Return type:

list

indexing()[source]

Indicates whether the Portal is updating its schematic indicies.

Returns:True if the Portal is indexing, False otherwise.
Return type:bool
make_search_url(search_args)[source]

Creates a URL encoded URL given the search arguments.

Parameters:search_argslist of two-item tuples of the form [(key, val), (key, val), ...].
Returns:The URL containing the URL encoded query.
Return type:str
search(search_args=[], url=None, limit=None)[source]

Searches the Portal using the provided query parameters, which will first be URL encoded. The user can pass in the query parameters and values via the search_args argument, or pass in a URL directly that contains a query string via the url argument, or provide values for both arguments in which case the query parameters specified in search_args will be added to the query parameters given in the URL.

If self.submission == True, then the query will be searced with “datastore=database”, unless the ‘database’ query parameter is already set.

Parameters:
  • search_argslist of two-item tuples of the form [(key, val), (key, val) ,...]. To support a != style query, append “!” to the key name.
  • urlstr. A URL used to search for records interactively in the ENCODE Portal. The query will be extracted from the URL.
  • limitint. The number of search results to send from the server. The default means to return all results.
Returns:

The search results.

Return type:

list

Raises:

requests.exceptions.HTTPError – The status code is not ok and != 404.

get_profile_from_payload(payload)[source]

Useful to call when doing a POST (and self.post() does call this). Ensures that the profile key identified by self.PROFILE_KEY exists in the passed-in payload and that the value is a recognized ENCODE object profile (schema) identifier. Alternatively, the user can set the profile in the more convoluted @id property.

Parameters:

payloaddict. The intended object data to POST.

Returns:

The ID of the profile if all validations pass, otherwise.

Return type:

str

Raises:
  • encode_utils.exceptions.ProfileNotSpecified – Both keys self.PROFILE_KEY and @id are missing in the payload.
  • encode_utils.profiles.UnknownProfile – The profile ID isn’t recognized by the class encode_utils.profiles.Profile.
get_lookup_ids_from_payload(payload)[source]

Given a payload to submit to the Portal, extracts the identifiers that can be used to lookup the record on the Portal, i.e. to see if the record already exists. Identifiers are extracted from the following fields:

  1. self.ENCID_KEY,
  2. aliases,
  3. md5sum (in the case of a file object)
Parameters:payloaddict. The data to submit.
Returns:The possible lookup identifiers.
Return type:list
get(rec_ids, database=False, ignore404=True, frame=None)[source]

GET a record from the Portal.

Looks up a record in the Portal and performs a GET request, returning the JSON serialization of the object. You supply a list of identifiers for a specific record, and the Portal will be searched for each identifier in turn until one is either found or the list is exhausted.

Parameters:
  • rec_idsstr or list. Must be a list if you want to supply more than one identifier. For a few example identifiers, you can use a uuid, accession, …, or even the value of a record’s @id property.
  • databasebool. If True, then search the database directly instead of the Elasticsearch. indices. Always True when in submission mode (self.submission is True).
  • framestr. A value for the frame query parameter, i.e. ‘object’, ‘edit’. See https://www.encodeproject.org/help/rest-api/ for details.
  • ignore404bool. Only matters when none of the passed in record IDs were found on the Portal. In this case, If set to True, then an empty dict will be returned. If set to False, then an Exception will be raised.
Returns:

The JSON response. Will be empty if no record was found AND ignore404=True.

Return type:

dict

Raises:
  • Exception – If the server responds with a FORBIDDEN status.
  • requests.exceptions.HTTPError – The status code is not ok, and the cause isn’t due to a 404 (not found) status code when ignore404=True.
set_attachment(document)[source]

Sets the attachment property for any profile that supports it, such as document or antibody_characterization.

Checks if the provided file is an image in either of the JPEG or TIFF formats - if so, then checks the image orientation in the EXIF data and rotates if if necessary. The original image will not be modified.

Parameters:documentstr. A local file path.
Returns:The ‘attachment’ propery value.
Return type:dict
after_submit_file_cloud_upload(rec_id, profile_id)[source]

An after-POST submit hook for uploading files to AWS.

Some objects, such as Files (file.json profile) need to have a corresponding file in the cloud. Where in the cloud the actual file should be uploaded to is indicated in File object’s file.upload_credentials.upload_url property. Once the File object is posted, this hook is used to perform the actual cloud upload of the physical, local file represented by the File object.

Parameters:
  • rec_idstr. An identifier for the new File object on the Portal.
  • profile_idstr. The ID of the profile that the record belongs to.
after_submit_hooks(rec_id, profile_id, method='', upload_file=True)[source]

Calls after-POST and after-PATCH hooks. This method is called from both the post() and patch() instance methods. Returns the None object immediately if the dry-run feature is enabled.

Some hooks only run if you are doing a PATCH, others if you are only doing a POST. Then there are some that run if you are doing either operation. Each hook that is called can potentially modify the payload.

Parameters:
  • rec_idstr. An identifier for the record on the Portal.
  • profile_idstr. The profile identifier indicating the profile that the record belongs to.
  • method – str. One of self.POST or self.PATCH, or the empty string to indicate which registered hooks to look through.
  • upload_filebool. If False, skip uploading files to the Portal. Defaults to True.
before_submit_alias(payload)[source]
A pre-POST and pre-PATCH hook used to
  1. Clean alias names by removing disallowed characters indicated by the DCC schema for the alias property.
  2. Add the lab alias prefix to any aliases that are missing it. The DCC_LAB environment variable is consulted to fetch the lab name, and if not set then this will be a no-op.
Parameters:payloaddict. The payload to submit to the Portal.
Returns:The potentially modified payload.
Return type:dict
before_submit_attachment(payload)[source]

A pre-POST and pre-PATCH hook used to simplify the creation of an attachment in profiles that support it.

Checks the payload for the presence of the attachment property that is used by certain profiles, i.e. document and antibody_characterization, and then checks to see if a particular shortcut is being employed to indicate the attachment. That shortcut works as follows: if the dictionary value of the ‘attachment’ key has a key named ‘path’ in it (case-sensitive), then the value is taken to be the path to a local file. Then, the actual attachment object is constructed, as defined in the document profile, by calling self.set_attachment(). Note that this shortcut is particular to this Connection class, and when used the ‘path’ key should be the only key in the attachment dictionary as any others will be ignored.

Parameters:payloaddict. The payload to submit to the Portal.
Returns:The potentially modified payload.
Return type:dict
before_post_file(payload)[source]

A pre-POST hook that calculates and sets the md5sum property and file_size property for a file record. However, any of these two properties that is already set in the payload to a non-empty value will not be reset.

Parameters:payloaddict. The payload to submit to the Portal.
Returns:The potentially modified payload.
Return type:dict
Raises:encode_utils.utils.MD5SumError – Perculated through the function encode_utils.utils.calculate_md5sum when it can’t calculate the md5sum.
before_post_fastq_file(payload)[source]

A pre-POST hook for FASTQ file objects that checks whether certain rules are followed as defined in the file.json schema.

For example, if the FASTQ file is sequenced single-end, then the property File.run_type should be set to single-ended as expected, however, the property File.paired_end shouldn’t be set in the payload, as the File.run_type property has the commment:

Only paired-ended files should have paired_end values
before_submit_hooks(payload, method='')[source]

Calls pre-POST and pre-PATCH hooks. This method is called from both the post() and patch() instance methods.

Some hooks only run if you are doing a PATCH, others if you are only doing a POST. Then there are some that run if you are doing either operation. Each hook that is called can potentially modify the payload.

Parameters:
  • payloaddict. The payload to POST or PATCH.
  • methodstr. One of “post” or “patch”, or the empty string to indicate which registered hooks to call. Some hooks are agnostic to the HTTP method, and these hooks are always called. Setting method to the empty string means to only call these agnostic hooks.
Returns:

The potentially modified payload that has been passed through all applicable pre-submit hooks.

Return type:

dict

post(payload, require_aliases=True, upload_file=True, return_original_status_code=False, truncate_long_strings_in_payload_log=False)[source]

POST a record to the Portal.

Requires that you include in the payload the non-schematic key self.PROFILE_KEY to designate the name of the ENCODE object profile that you are submitting to, or the actual @id property itself.

If the lab property isn’t present in the payload, then the default will be set to the value of the DCC_LAB environment variable. Similarly, if the award property isn’t present, then the default will be set to the value of the DCC_AWARD environment variable.

Before the POST is attempted, any pre-POST hooks are fist called; see the method self.before_submit_hooks). After a successfuly POST, any after-POST submit hooks are also run; see the method self.after_submit_hooks.

Parameters:
  • payloaddict. The data to submit.
  • require_aliasesbool. True means that the ‘aliases’ property is to be required in payload. This is the default and it is highly recommended not to change this because it’ll be easy to create duplicates on the server if accidentally POSTING the same payload again. For example, you can easily create the same biosample as many times as you want on the Portal when not providing an alias. Furthermore, submitting labs should include at least one alias per record being submitted to the Portal for traceabilty purposes in the submitting lab.
  • upload_filebool. If False, when POSTing files the file data will not be uploaded to S3, defaults to True. This can be useful if you have custom upload logic. If the files to upload are already on disk, it is recommmended to leave this with the default, which will use aws s3 cp to upload them.
  • return_original_status_codebool. Defaults to False. If True, then will return the original requests.Response.status_code of the initial post, in addition to the usual dict response.
  • truncate_long_strings_in_payload_logbool. Defaults to False. If True, then long strings (> 1000 characters) present in the payload will be truncated before being logged.
Returns:

The JSON response from the POST operation, or the existing record if it already exists on the Portal (where a GET on any of it’s aliases, when provided in the payload, finds the existing record). If return_original_status_code=True, then will return a tuple of the above dict and an int corresponding to the status code on POST of the initial payload.

Return type:

dict

Raises:
  • encode_utils.exceptions.AwardPropertyMissing – The award property isn’t present in the payload and there isn’t a defualt set by the environment variable DCC_AWARD.
  • encode_utils.exceptions.LabPropertyMissing – The lab property isn’t present in the payload and there isn’t a default set by the environment variable DCC_LAB.
  • encode_utils.exceptions.MissingAlias – The argument ‘require_aliases’ is set to True and the ‘aliases’ property is missing in the payload or is empty.
  • requests.exceptions.HTTPError – The return status is not ok.
Side effects:
self.PROFILE_KEY will be popped out of the payload if present, otherwise, the key “@id” will be popped out. Furthermore, self.ENCID_KEY will be popped out if present in the payload.
patch(payload, raise_403=True, extend_array_values=True)[source]

PATCH a record on the Portal.

Before the PATCH is attempted, any pre-PATCH hooks are fist called (see the method self.before_submit_hooks()). If the PATCH fails due to the resource not being found (404), then that fact is logged to both the debug and error loggers.

Parameters:
  • payloaddict. containing the attribute key and value pairs to patch. Must contain the key self.ENCID_KEY in order to indicate which record to PATCH.
  • raise_403bool. True means to raise a requests.exceptions.HTTPError if a 403 status (forbidden) is returned. If set to False and there still is a 403 return status, then the object you were trying to PATCH will be fetched from the Portal in JSON format as this function’s return value.
  • extend_array_valuesbool. Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
Returns:

The JSON response from the PATCH operation, or an empty dict if the record doesn’t

exist on the Portal. Will also be an empty dict if the dry-run feature is enabled.

Return type:

dict

Raises:
  • KeyError – The payload doesn’t have the key self.ENCID_KEY set AND there aren’t any aliases provided in the payload’s ‘aliases’ key.
  • requests.exceptions.HTTPError – if the return status is not ok (excluding a 403 status if ‘raise_403’ is False.
remove_props(rec_id, props=[])[source]

Runs a PUT request to remove properties of interest on the specified record.

Note that before-submit and after-submit hooks are not run here as they would be in self.patch() or self.post() (before_submit_hooks() and after_submit_hooks() are not called).

Parameters:
  • rec_idstr. An identifier for the record on the Portal.
  • propslist. The properties to remove from the record.

Raises:

Returns:dict. Contains the JSON returned from the PUT request.
remove_and_patch(props, patch, raise_403=True, extend_array_values=True)[source]

Runs a PUT request to remove properties and patch a record in one request.

In general, this is a method combining remove_props and patch. This is useful because some schema dependencies requires property removal and property patch (including adding new properties) happening at the same time. Please note that after the record is retrieved from the portal, props will be removed before the patch is applied.

Parameters:
  • propslist. The properties to remove from the record.
  • patchdict. containing the attribute key and value pairs to patch. Must contain the key self.ENCID_KEY in order to indicate which record to PATCH.
  • raise_403bool. True means to raise a requests.exceptions.HTTPError if a 403 status (forbidden) is returned. If set to False and there still is a 403 return status, then the object you were trying to PATCH will be fetched from the Portal in JSON format as this function’s return value.
  • extend_array_valuesbool. Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
Returns:

The JSON response from the PUT operation, or an empty dict

if the record doesn’t exist on the Portal. Will also be an empty dict if the dry-run feature is enabled.

Return type:

dict

Raises:

requests.exceptions.HTTPError – if the return status is not ok (excluding a 403 status if ‘raise_403’ is False).

send(payload, error_if_not_found=False, extend_array_values=True, raise_403=True)[source]

Deprecated since version 1.1.1: Will be removed in the next major release.

A wrapper over self.post() and self.patch() that determines which to call based on whether the record exists on the Portal. Especially useful when submitting a high-level object, such as an experiment which contains many dependent objects, in which case you could have a mix where some need to be POST’d and some PATCH’d.

Parameters:
  • payloaddict. The data to submit.
  • error_if_not_foundbool. If set to True, then a PATCH will be attempted and a requests.exceptions.HTTPError will be raised if the record doesn’t exist on the Portal.
  • extend_array_valuesbool. Only matters when doing a PATCH, and Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
  • raise_403bool. Only matters when doing a PATCH. True means to raise an requests.exceptions.HTTPError if a 403 status (forbidden) is returned. If set to False and there still is a 403 return status, then the object you were trying to PATCH will be fetched from the Portal in JSON format as this function’s return value (as handled by self.patch()).
Raises:

requests.exceptions.HTTPError – You want to do a PATCH (indicated by setting error_if_not_found=True) but the record isn’t found.

get_fastqfiles_on_exp(exp_id)[source]

Returns a list of all FASTQ file objects in the experiment.

Parameters:exp_idstr. An Experiment identifier.
Returns:Each element is the JSON form of a FASTQ file record.
Return type:list
get_fastqfile_replicate_hash(exp_id)[source]

Given a DCC experiment ID, gets its JSON representation from the Portal and looks in the original property to find FASTQ file objects and creates a dict organized by replicate numbers. Keying through the dict by replicate numbers, you can get to a particular file object’s JSON serialization.

Parameters:exp_idstr. An Experiment identifier.
Returns:dict where each key is a biological_replicate_number. The value of each key is another dict where each key is a technical_replicate_number. The value of this is yet another dict with keys being file read numbers - 1 for forward reads, 2 for reverse reads. The value for a given key of this most inner dictionary is a list of JSON-serialized file objects.
Return type:dict
extract_aws_upload_credentials(creds)[source]

Sets values for the AWS CLI security credentials (for uploading a file to AWS S3) to the credentials found in a file record’s upload_credentials property. The security credentials are stored in a dict where the keys are named after environment variables to be used by the AWS CLI.

Parameters:credsdict: The value of a File object’s upload_credentials property.
Returns:dict containing keys named after AWS CLI environment variables being:
  1. AWS_ACCESS_KEY_ID,
  2. AWS_SECRET_ACCESS_KEY,
  3. AWS_SECURITY_TOKEN,
  4. UPLOAD_URL

Will be empty if the upload_credentials property isn’t present in file_json.

Return type:dict
get_upload_credentials(file_id)[source]

Similar to self.extract_aws_upload_credentials(), but it goes a step further in that it is capable of regenerating the upload credentials if they aren’t currently present in the file record.

Parameters:file_idstr. A file object identifier (i.e. accession, uuid, alias, md5sum).
Returns:
The value of the upload_credentials property if present, otherwise, the dict
returned by self.regenerate_aws_upload_creds, which tries to generate the value for this property.
Return type:dict
regenerate_aws_upload_creds(file_id)[source]

Reissues AWS S3 upload credentials for the specified file record.

Parameters:file_idstr. An identifier for a file record on the Portal.
Returns:dict containing the value of the ‘upload_credentials’ key in the JSON serialization of the file record represented by file_id. Will be empty if new upload credentials could not be issued.
Return type:dict
Raises:requests.exceptions.HTTPError – The response from the server isn’t a successful status code.
gcp_transfer_urllist(file_ids, filename)[source]

Creates a “URL list” file to be used by the Google Storage Transfer Service (STS); see documentation at https://cloud.google.com/storage-transfer/docs/create-url-list. Once the URL list is created, you need to upload it somewhere that Google STS can reach it via HTTP or HTTPS. I recommend uploading the URL list to your GCS bucket. From there, you can get an HTTPS URL for it by clicking on your file name (while in the GCP Console) and then copying the URL shown in your Web browser, which can in turn be pasted directly in the Google STS.

Parameters:
  • file_idslist of file identifiers. The corresponding S3 objects must have public read permission as required for the URL list.
  • filenamestr. The output filename in TSV format, which can be fed into the Google STS.
gcp_transfer_from_aws(file_ids, gcp_bucket, gcp_project, description='', aws_creds=())[source]

Copies one or more ENCODE files from AWS S3 storage to GCP storage by using the Google STS. This is similar to the gcp_transfer_urllist() method - the difference is that S3 object paths are copied directly instead of using public HTTPS URIs, and AWS keys are required here.

See encode_utils.transfer_to_gcp.Transfer() for full documentation.

Parameters:
  • file_idslist. One or more ENCODE files to transfer. They can be any valid ENCODE File object identifier. Don’t mix ENCODE files from across buckets.
  • gcp_bucketstr. The name of the GCP bucket.
  • gcp_projectstr. The GCP project that is associated with gcp_bucket. Can be given in either integer form or the user-friendly name form (i.e. sigma-night-206802)
  • descriptionstr. The description to show when querying transfers via the Google Storage Transfer API, or via the GCP Console. May be left empty, in which case the default description will be the value of the first S3 file name to transfer.
  • aws_credstuple. Ideally, your AWS credentials will be stored in the environment. For additional flexability though, you can specify them here as well in the form (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).
Returns:

The JSON response representing the newly created transferJob.

Return type:

dict

upload_file(file_id, file_path=None, set_md5sum=False)[source]

Uploads a file to the Portal for the indicated file record. The file to upload can be specified by setting the file_path parameter, or by using the value of the ENCODE file profile’s submitted_file_name property of the given file object represented by the file_id parameter. The file to upload can be from any of the following sources:

  1. Path to a local file,
  2. S3 object, or
  3. Google Storage object

For the AWS option above, the user must set the proper AWS keys, see the wiki documentation.

If the dry-run feature is enabled, then this method will return prior to launching the upload command.

Parameters:
  • file_idstr. An identifier of a file record on the ENCODE Portal.
  • file_pathstr. The local path to the file to upload, or an S3 object (i.e s3://mybucket/test.txt), or a Google Storage object (i.e. gs://mybucket/test.txt). If not set, defaults to None in which case the local file path will be extracted from the record’s submitted_file_name property.
  • set_md5sumbool. True means to also calculate the md5sum and set the file record’s md5sum property on the Portal (this currently is only implemented for local files and S3; not yet GCP). This will always take place whenever the property isn’t yet set. Furthermore, setting to True will also cause the file_size property to be set. Normally these two properties would already be set as they are required in the file profile, however, if the wrong file was originally uploaded, then they must be reset when uploading a new file.
Raises:

encode_utils.exceptions.FileUploadFailed – The return code of the AWS upload command was non-zero.

get_platforms_on_experiment(rec_id)[source]

Looks at all FASTQ files on the specified experiment, and tallies up the varying sequencing platforms that generated them. The platform of a given file record is indicated by the platform property. This is moreless used to verify that there aren’t a mix of multiple different platforms present as normally all reads should come from the same platform.

Parameters:rec_idstr. DCC identifier for an experiment.
Returns:The de-duplicated list of platforms seen on the experiment’s FASTQ files.
Return type:list
post_document(document, document_type, description)[source]

POSTS a document to the Portal.

The alias for the document will be the lab prefix plus the file name. The lab prefix is taken as the value of the DCC_LAB environment variable, i.e. ‘michael-snyder’.

Parameters:
  • document_typestr. For possible values, see https://www.encodeproject.org/profiles/document.json. It appears that one should use “data QA” for analysis results documents.
  • descriptionstr. The description for the document.
  • documentstr. Local file path to the document to be submitted.
Returns:

The DCC UUID of the new document.

Return type:

str

download(rec_id, get_stream=False, directory=None)[source]

Downloads the contents of the specified file or document object from the ENCODE Portal to either the calling directory or the indicated download directory. The downloaded file will be named as it is on the Portal.

Alternativly, you can get a reference to the response object by setting the get_stream parameter to True. Useful if you want to inspect the response, i.e. see if there was a redirect and where to, or download the byte stream in a customized manner.

Parameters:
  • rec_idstr. A DCC identifier for a file or document record on the Portal.
  • directorystr. The full path to the directory in which to download the file. If not specified, then the file will be downloaded in the calling directory.
Returns:

str. The full path to the downloaded file if the get_stream parameter is False. requests.models.Response: The get_stream parameter is True.

s3_object_path(rec_id, url=False)[source]

Given an ENCODE File object’s id (such as accession, uuid, alias), returns the full S3 object URI, or HTTP/HTTPS URI if url=True.

Parameters:
  • rec_idstr. A DCC object identifier of the record to link the document to.
  • urlbool. True means to return the HTTP/HTTPS URI of the file rather than the S3 URI. Useful if this is a released file since you can download via the URL.
get_experiments_with_biosample(rec_id)[source]

Returns all experiments that have a link to the given biosample record. Technically, there should only be at most one experiment linked to a given biosample, but it’s possible that additional experiments can be, incorrectly, with audit flags going off.

Parameters:rec_idstr. An identifier for a biosample record on the Portal.
Returns:
list of dicts, where is dict is the JSON serialization of an experiment record that is
linked to the provided biosample record. If no experiments are linked, then this will be an empty list.
get_biosample_type(classification, term_id=None, term_name=None)[source]

Searches the biosample_types for the given classification (i.e. tissue, cell line) and term_id or term_name. Both term_name and term_id need not be set - if both are than term_id will take precedence. The combination of classification and term_id/term_name uniquely identifies a biosample_type.

Parameters:
  • classificationstr. A value for the ‘classificaiton’ property of the biosample_ontology profile.
  • term_idstr. A value for the ‘term_id’ property of the biosample_ontology profile.
  • term_namestr. A value for the ‘term_name’ property of the biosample_ontology profile.
Returns:

dict. Empty if not biosample_type found, otherwise the JSON representation of the record.

Raises:
  • RecordNotFound – No search results.
  • Exception – More than one search result was returned. This should not happen and if it does then it’s likely a bug on the server side.