encode_utils.connection¶
-
encode_utils.connection.
LOG_DIR
= 'EU_Logs'¶ The directory that contains the log files created by the Connection class.
-
class
encode_utils.connection.
Connection
(dcc_mode=None, dry_run=False, submission=False, no_log_file=False)[source]¶ Bases:
object
Handles communication with the Portal regarding data submission and retrieval.
For data submission or modification, and working with non-released datasets, you must have the environment variables DCC_API_KEY and DCC_SECRET_KEY set. Check with your DCC data wrangler if you haven’t been assigned these keys.
There are three log files opened in append mode in the directory specified by
connection.LOG_DIR
that are specific to whichever Portal you are connected to. When connected to Production, each log file name will include the token ‘_prod_’. For Development, the token will be ‘_dev_’. The three log files are named accordingly in reference to their purpose, and are classified as:- debug log file - All messages sent to STDOUT are also written to this log file. In addition, all messages written to the error log file described below are logged here.
- error log file - Only terse error messages are sent to this log file for quick scanning of any potential issues. If you identify an error here that needs more explanation, then you should consult the debug log file.
- posted log file - Tabulates what was successfully POSTED. There are three tab-delimited colummns ordered as submission timestamp, record alias, and record accession (or UUID if the accession property doesn’t exist for the profile of the record at hand). Note that if a record has several aliases, then only the first one in the list for the aliases property is used.
-
ENCID_KEY
= '_enc_id'¶ Identifies the name of the key in the payload that stores a valid ENCODE-assigned identifier for a record, such as alias, accession, uuid, md5sum, … depending on the object being submitted. This is not a valid property of any ENCODE object schema, and is used in the
patch()
instance method to designate the record to update.
-
PROFILE_KEY
= '_profile'¶ Identifies the name of the key in the payload that stores the ID of the profile to submit to. Like
ENCID_KEY
, this is a non-schematic key that is used only internally.
-
POST
= 'post'¶ Constant
-
PATCH
= 'patch'¶ Constant
-
debug_logger
= None¶ A reference to the debug logging instance that was created earlier in
encode_utils.debug_logger
. This class adds a file handler, such that all messages sent to it are logged to this file in addition to STDOUT.
-
dcc_modes
= None¶ An indication of which Portal instance to use. Set to ‘prod’ for the production Portal, and ‘dev’ for the development Portal. Alternatively, you can set an explicit host, such as demo.encodedcc.org. Leaving the default of None means to use the value of the DCC_MODE environment variable.
-
dry_run
= None¶ Set to True to prevent any server-side changes on the ENCODE Portal, i.e. PUT, POST, PATCH, DELETE requests will not be sent to the Portal. After-POST and after-PATCH hooks (see the instance method
after_submit_hooks()
) will not be run either in this case. You can turn off this dry-run feature by calling the instance methodset_live_run()
.
-
error_logger
= None¶ A
logging
instance with a file handler for logging terse error messages. The log file resides locally within the directory specified by the constantconnection.LOG_DIR
. Accepts messages >=logging.ERROR
.
-
post_logger
= None¶ A
logging
instance with a file handler for logging successful POST operations. The log file resides locally within the directory specified by the constantconnection.LOG_DIR
. Accepts messages >=logging.INFO
.
-
auth
¶ Sets the API and secret keys to use when authenticating with the DCC servers. These are determined from the values of the DCC_API_KEY and DCC_SECRET_KEY environment variables via the
_get_api_keys_from_env()
private instance method.
-
set_submission
(status)[source]¶ Sets the boolean value of the
self.submission
attribute.Parameters: status – bool.
-
check_dry_run
()[source]¶ Checks if the dry-run feature is enabled, and if so, logs the fact. This is mainly meant to be called by other methods that are designed to make modifications on the ENCODE Portal.
Returns: The dry-run feature is enabled. False: The dry-run feature is turned off. Return type: True
-
add_alias_prefix
(aliases, prefix=False)[source]¶ Given a list of aliases, adds the lab prefix to each one that doesn’t yet have a prefix set. The lab prefix is taken as the passed-in prefix, otherwise, it defaults to the DCC_LAB environment variable, and it must be a value assigned by the DCC, i.e. “michael-snyder” for the Snyder Production Center. The DCC requires that aliases be prefixed in this manner.
Parameters: - aliases – list of aliases.
- prefix – str. The DCC assigned lab prefix to use. If not specified, then the default is the value of the DCC_LAB environment variable.
Returns: list.
Raises: Exception – A passed-in alias doesn’t have a prefix set, and the default prefix could not be determined.
Examples:
add_alias_prefix(aliases=["my-alias"],prefix="michael-snyder") # Returns ["michael-snyder:my-alias"] add_alias_prefix(aliases=["michael-snyder:my-alias"],prefix="michael-snyder") # Returns ["michael-snyder:my-alias"] add_alias_prefix(aliases=["my_alias"], prefix="bad-value") # Raises an Exception since this lab prefix isn't from a registered source record on # the Portal.
-
get_aliases
(dcc_id, strip_alias_prefix=False)[source]¶ Given an ENCODE identifier for an object, performs a GET request and extracts the aliases.
Parameters: - dcc_id – str. The ENCODE ID for a given object, i.e ENCSR999EHG.
- strip_alias_prefix – bool. True means to remove the alias prefix if all return aliases.
Returns: The aliases.
Return type: list
-
indexing
()[source]¶ Indicates whether the Portal is updating its schematic indicies.
Returns: True if the Portal is indexing, False otherwise. Return type: bool
-
make_search_url
(search_args)[source]¶ Creates a URL encoded URL given the search arguments.
Parameters: search_args – list of two-item tuples of the form [(key, val), (key, val), ...]
.Returns: The URL containing the URL encoded query. Return type: str
-
search
(search_args=[], url=None, limit=None)[source]¶ Searches the Portal using the provided query parameters, which will first be URL encoded. The user can pass in the query parameters and values via the search_args argument, or pass in a URL directly that contains a query string via the url argument, or provide values for both arguments in which case the query parameters specified in search_args will be added to the query parameters given in the URL.
If
self.submission == True
, then the query will be searced with “datastore=database”, unless the ‘database’ query parameter is already set.Parameters: - search_args – list of two-item tuples of the form
[(key, val), (key, val) ,...]
. To support a != style query, append “!” to the key name. - url – str. A URL used to search for records interactively in the ENCODE Portal. The query will be extracted from the URL.
- limit – int. The number of search results to send from the server. The default means to return all results.
Returns: The search results.
Return type: list
Raises: requests.exceptions.HTTPError – The status code is not ok and != 404.
- search_args – list of two-item tuples of the form
-
get_profile_from_payload
(payload)[source]¶ Useful to call when doing a POST (and
self.post()
does call this). Ensures that the profile key identified byself.PROFILE_KEY
exists in the passed-in payload and that the value is a recognized ENCODE object profile (schema) identifier. Alternatively, the user can set the profile in the more convoluted @id property.Parameters: payload – dict. The intended object data to POST.
Returns: The ID of the profile if all validations pass, otherwise.
Return type: str
Raises: encode_utils.exceptions.ProfileNotSpecified
– Both keysself.PROFILE_KEY
and @id are missing in the payload.encode_utils.profiles.UnknownProfile
– The profile ID isn’t recognized by the class encode_utils.profiles.Profile.
-
get_lookup_ids_from_payload
(payload)[source]¶ Given a payload to submit to the Portal, extracts the identifiers that can be used to lookup the record on the Portal, i.e. to see if the record already exists. Identifiers are extracted from the following fields:
self.ENCID_KEY
,- aliases,
- md5sum (in the case of a file object)
Parameters: payload – dict. The data to submit. Returns: The possible lookup identifiers. Return type: list
-
get
(rec_ids, database=False, ignore404=True, frame=None)[source]¶ GET a record from the Portal.
Looks up a record in the Portal and performs a GET request, returning the JSON serialization of the object. You supply a list of identifiers for a specific record, and the Portal will be searched for each identifier in turn until one is either found or the list is exhausted.
Parameters: - rec_ids – str or list. Must be a list if you want to supply more than one identifier. For a few example identifiers, you can use a uuid, accession, …, or even the value of a record’s @id property.
- database – bool. If True, then search the database directly instead of the Elasticsearch. indices. Always True when in submission mode (self.submission is True).
- frame – str. A value for the frame query parameter, i.e. ‘object’, ‘edit’. See https://www.encodeproject.org/help/rest-api/ for details.
- ignore404 – bool. Only matters when none of the passed in record IDs were found on the Portal. In this case, If set to True, then an empty dict will be returned. If set to False, then an Exception will be raised.
Returns: The JSON response. Will be empty if no record was found AND
ignore404=True
.Return type: dict
Raises: - Exception – If the server responds with a FORBIDDEN status.
- requests.exceptions.HTTPError – The status code is not ok, and the
cause isn’t due to a 404 (not found) status code when
ignore404=True
.
-
set_attachment
(document)[source]¶ Sets the attachment property for any profile that supports it, such as document or antibody_characterization.
Checks if the provided file is an image in either of the JPEG or TIFF formats - if so, then checks the image orientation in the EXIF data and rotates if if necessary. The original image will not be modified.
Parameters: document – str. A local file path. Returns: The ‘attachment’ propery value. Return type: dict
-
after_submit_file_cloud_upload
(rec_id, profile_id)[source]¶ An after-POST submit hook for uploading files to AWS.
Some objects, such as Files (file.json profile) need to have a corresponding file in the cloud. Where in the cloud the actual file should be uploaded to is indicated in File object’s file.upload_credentials.upload_url property. Once the File object is posted, this hook is used to perform the actual cloud upload of the physical, local file represented by the File object.
Parameters: - rec_id – str. An identifier for the new File object on the Portal.
- profile_id – str. The ID of the profile that the record belongs to.
-
after_submit_hooks
(rec_id, profile_id, method='', upload_file=True)[source]¶ Calls after-POST and after-PATCH hooks. This method is called from both the
post()
andpatch()
instance methods. Returns the None object immediately if the dry-run feature is enabled.Some hooks only run if you are doing a PATCH, others if you are only doing a POST. Then there are some that run if you are doing either operation. Each hook that is called can potentially modify the payload.
Parameters: - rec_id – str. An identifier for the record on the Portal.
- profile_id – str. The profile identifier indicating the profile that the record belongs to.
- method – str. One of
self.POST
orself.PATCH
, or the empty string to indicate which registered hooks to look through. - upload_file – bool. If False, skip uploading files to the Portal. Defaults to True.
-
before_submit_alias
(payload)[source]¶ - A pre-POST and pre-PATCH hook used to
- Clean alias names by removing disallowed characters indicated by the DCC schema for the alias property.
- Add the lab alias prefix to any aliases that are missing it. The DCC_LAB environment variable is consulted to fetch the lab name, and if not set then this will be a no-op.
Parameters: payload – dict. The payload to submit to the Portal. Returns: The potentially modified payload. Return type: dict
-
before_submit_attachment
(payload)[source]¶ A pre-POST and pre-PATCH hook used to simplify the creation of an attachment in profiles that support it.
Checks the payload for the presence of the attachment property that is used by certain profiles, i.e. document and antibody_characterization, and then checks to see if a particular shortcut is being employed to indicate the attachment. That shortcut works as follows: if the dictionary value of the ‘attachment’ key has a key named ‘path’ in it (case-sensitive), then the value is taken to be the path to a local file. Then, the actual attachment object is constructed, as defined in the document profile, by calling
self.set_attachment()
. Note that this shortcut is particular to thisConnection
class, and when used the ‘path’ key should be the only key in the attachment dictionary as any others will be ignored.Parameters: payload – dict. The payload to submit to the Portal. Returns: The potentially modified payload. Return type: dict
-
before_post_file
(payload)[source]¶ A pre-POST hook that calculates and sets the md5sum property and file_size property for a file record. However, any of these two properties that is already set in the payload to a non-empty value will not be reset.
Parameters: payload – dict. The payload to submit to the Portal. Returns: The potentially modified payload. Return type: dict Raises: encode_utils.utils.MD5SumError
– Perculated through the function encode_utils.utils.calculate_md5sum when it can’t calculate the md5sum.
-
before_post_fastq_file
(payload)[source]¶ A pre-POST hook for FASTQ file objects that checks whether certain rules are followed as defined in the file.json schema.
For example, if the FASTQ file is sequenced single-end, then the property
File.run_type
should be set to single-ended as expected, however, the propertyFile.paired_end
shouldn’t be set in the payload, as theFile.run_type
property has the commment:Only paired-ended files should have paired_end values
-
before_submit_hooks
(payload, method='')[source]¶ Calls pre-POST and pre-PATCH hooks. This method is called from both the
post()
andpatch()
instance methods.Some hooks only run if you are doing a PATCH, others if you are only doing a POST. Then there are some that run if you are doing either operation. Each hook that is called can potentially modify the payload.
Parameters: - payload – dict. The payload to POST or PATCH.
- method – str. One of “post” or “patch”, or the empty string to indicate which registered hooks to call. Some hooks are agnostic to the HTTP method, and these hooks are always called. Setting method to the empty string means to only call these agnostic hooks.
Returns: The potentially modified payload that has been passed through all applicable pre-submit hooks.
Return type: dict
-
post
(payload, require_aliases=True, upload_file=True, return_original_status_code=False, truncate_long_strings_in_payload_log=False)[source]¶ POST a record to the Portal.
Requires that you include in the payload the non-schematic key
self.PROFILE_KEY
to designate the name of the ENCODE object profile that you are submitting to, or the actual @id property itself.If the lab property isn’t present in the payload, then the default will be set to the value of the DCC_LAB environment variable. Similarly, if the award property isn’t present, then the default will be set to the value of the DCC_AWARD environment variable.
Before the POST is attempted, any pre-POST hooks are fist called; see the method
self.before_submit_hooks
). After a successfuly POST, any after-POST submit hooks are also run; see the methodself.after_submit_hooks
.Parameters: - payload – dict. The data to submit.
- require_aliases – bool. True means that the ‘aliases’ property is to be required in payload. This is the default and it is highly recommended not to change this because it’ll be easy to create duplicates on the server if accidentally POSTING the same payload again. For example, you can easily create the same biosample as many times as you want on the Portal when not providing an alias. Furthermore, submitting labs should include at least one alias per record being submitted to the Portal for traceabilty purposes in the submitting lab.
- upload_file – bool. If False, when POSTing files the file data will not be uploaded to S3, defaults to True. This can be useful if you have custom upload logic. If the files to upload are already on disk, it is recommmended to leave this with the default, which will use aws s3 cp to upload them.
- return_original_status_code – bool. Defaults to False. If True, then will return the original requests.Response.status_code of the initial post, in addition to the usual dict response.
- truncate_long_strings_in_payload_log – bool. Defaults to False. If True, then long strings (> 1000 characters) present in the payload will be truncated before being logged.
Returns: The JSON response from the POST operation, or the existing record if it already exists on the Portal (where a GET on any of it’s aliases, when provided in the payload, finds the existing record). If return_original_status_code=True, then will return a tuple of the above dict and an int corresponding to the status code on POST of the initial payload.
Return type: dict
Raises: encode_utils.exceptions.AwardPropertyMissing
– The award property isn’t present in the payload and there isn’t a defualt set by the environment variable DCC_AWARD.encode_utils.exceptions.LabPropertyMissing
– The lab property isn’t present in the payload and there isn’t a default set by the environment variable DCC_LAB.encode_utils.exceptions.MissingAlias
– The argument ‘require_aliases’ is set to True and the ‘aliases’ property is missing in the payload or is empty.requests.exceptions.HTTPError
– The return status is not ok.
- Side effects:
- self.PROFILE_KEY will be popped out of the payload if present, otherwise, the key “@id” will be popped out. Furthermore, self.ENCID_KEY will be popped out if present in the payload.
-
patch
(payload, raise_403=True, extend_array_values=True)[source]¶ PATCH a record on the Portal.
Before the PATCH is attempted, any pre-PATCH hooks are fist called (see the method
self.before_submit_hooks()
). If the PATCH fails due to the resource not being found (404), then that fact is logged to both the debug and error loggers.Parameters: - payload – dict. containing the attribute key and value pairs to patch. Must contain the key
self.ENCID_KEY
in order to indicate which record to PATCH. - raise_403 – bool. True means to raise a
requests.exceptions.HTTPError
if a 403 status (forbidden) is returned. If set to False and there still is a 403 return status, then the object you were trying to PATCH will be fetched from the Portal in JSON format as this function’s return value. - extend_array_values – bool. Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
Returns: - The JSON response from the PATCH operation, or an empty dict if the record doesn’t
exist on the Portal. Will also be an empty dict if the dry-run feature is enabled.
Return type: dict
Raises: KeyError
– The payload doesn’t have the keyself.ENCID_KEY
set AND there aren’t any aliases provided in the payload’s ‘aliases’ key.requests.exceptions.HTTPError
– if the return status is not ok (excluding a 403 status if ‘raise_403’ is False.
- payload – dict. containing the attribute key and value pairs to patch. Must contain the key
-
remove_props
(rec_id, props=[])[source]¶ Runs a PUT request to remove properties of interest on the specified record.
Note that before-submit and after-submit hooks are not run here as they would be in self.patch() or self.post() (
before_submit_hooks()
andafter_submit_hooks()
are not called).Parameters: - rec_id – str. An identifier for the record on the Portal.
- props – list. The properties to remove from the record.
Raises:
Returns: dict. Contains the JSON returned from the PUT request.
-
remove_and_patch
(props, patch, raise_403=True, extend_array_values=True)[source]¶ Runs a PUT request to remove properties and patch a record in one request.
In general, this is a method combining
remove_props
andpatch
. This is useful because some schema dependencies requires property removal and property patch (including adding new properties) happening at the same time. Please note that after the record is retrieved from the portal,props
will be removed before thepatch
is applied.Parameters: - props – list. The properties to remove from the record.
- patch – dict. containing the attribute key and value pairs to
patch. Must contain the key
self.ENCID_KEY
in order to indicate which record to PATCH. - raise_403 – bool. True means to raise a
requests.exceptions.HTTPError
if a 403 status (forbidden) is returned. If set to False and there still is a 403 return status, then the object you were trying to PATCH will be fetched from the Portal in JSON format as this function’s return value. - extend_array_values – bool. Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
Returns: - The JSON response from the PUT operation, or an empty dict
if the record doesn’t exist on the Portal. Will also be an empty dict if the dry-run feature is enabled.
Return type: dict
Raises: requests.exceptions.HTTPError
– if the return status is not ok (excluding a 403 status if ‘raise_403’ is False).
-
send
(payload, error_if_not_found=False, extend_array_values=True, raise_403=True)[source]¶ Deprecated since version 1.1.1: Will be removed in the next major release.
A wrapper over
self.post()
andself.patch()
that determines which to call based on whether the record exists on the Portal. Especially useful when submitting a high-level object, such as an experiment which contains many dependent objects, in which case you could have a mix where some need to be POST’d and some PATCH’d.Parameters: - payload – dict. The data to submit.
- error_if_not_found – bool. If set to True, then a PATCH will be attempted and a
requests.exceptions.HTTPError
will be raised if the record doesn’t exist on the Portal. - extend_array_values – bool. Only matters when doing a PATCH, and Only affects keys with array values. True (default) means to extend the corresponding value on the Portal with what’s specified in the payload. False means to replace the value on the Portal with what’s in the payload.
- raise_403 – bool. Only matters when doing a PATCH. True means to raise an
requests.exceptions.HTTPError if a 403 status (forbidden) is returned.
If set to False and there still is a 403 return status, then the object you were
trying to PATCH will be fetched from the Portal in JSON format as this function’s
return value (as handled by
self.patch()
).
Raises: requests.exceptions.HTTPError
– You want to do a PATCH (indicated by settingerror_if_not_found=True
) but the record isn’t found.
-
get_fastqfiles_on_exp
(exp_id)[source]¶ Returns a list of all FASTQ file objects in the experiment.
Parameters: exp_id – str. An Experiment identifier. Returns: Each element is the JSON form of a FASTQ file record. Return type: list
-
get_fastqfile_replicate_hash
(exp_id)[source]¶ Given a DCC experiment ID, gets its JSON representation from the Portal and looks in the original property to find FASTQ file objects and creates a dict organized by replicate numbers. Keying through the dict by replicate numbers, you can get to a particular file object’s JSON serialization.
Parameters: exp_id – str. An Experiment identifier. Returns: dict where each key is a biological_replicate_number. The value of each key is another dict where each key is a technical_replicate_number. The value of this is yet another dict with keys being file read numbers - 1 for forward reads, 2 for reverse reads. The value for a given key of this most inner dictionary is a list of JSON-serialized file objects. Return type: dict
-
extract_aws_upload_credentials
(creds)[source]¶ Sets values for the AWS CLI security credentials (for uploading a file to AWS S3) to the credentials found in a file record’s upload_credentials property. The security credentials are stored in a dict where the keys are named after environment variables to be used by the AWS CLI.
Parameters: creds – dict: The value of a File object’s upload_credentials property. Returns: dict containing keys named after AWS CLI environment variables being: - AWS_ACCESS_KEY_ID,
- AWS_SECRET_ACCESS_KEY,
- AWS_SECURITY_TOKEN,
- UPLOAD_URL
Will be empty if the upload_credentials property isn’t present in file_json.
Return type: dict
-
get_upload_credentials
(file_id)[source]¶ Similar to
self.extract_aws_upload_credentials()
, but it goes a step further in that it is capable of regenerating the upload credentials if they aren’t currently present in the file record.Parameters: file_id – str. A file object identifier (i.e. accession, uuid, alias, md5sum). Returns: - The value of the upload_credentials property if present, otherwise, the dict
- returned by
self.regenerate_aws_upload_creds
, which tries to generate the value for this property.
Return type: dict
-
regenerate_aws_upload_creds
(file_id)[source]¶ Reissues AWS S3 upload credentials for the specified file record.
Parameters: file_id – str. An identifier for a file record on the Portal. Returns: dict containing the value of the ‘upload_credentials’ key in the JSON serialization of the file record represented by file_id. Will be empty if new upload credentials could not be issued. Return type: dict Raises: requests.exceptions.HTTPError – The response from the server isn’t a successful status code.
-
gcp_transfer_urllist
(file_ids, filename)[source]¶ Creates a “URL list” file to be used by the Google Storage Transfer Service (STS); see documentation at https://cloud.google.com/storage-transfer/docs/create-url-list. Once the URL list is created, you need to upload it somewhere that Google STS can reach it via HTTP or HTTPS. I recommend uploading the URL list to your GCS bucket. From there, you can get an HTTPS URL for it by clicking on your file name (while in the GCP Console) and then copying the URL shown in your Web browser, which can in turn be pasted directly in the Google STS.
Parameters: - file_ids – list of file identifiers. The corresponding S3 objects must have public read permission as required for the URL list.
- filename – str. The output filename in TSV format, which can be fed into the Google STS.
-
gcp_transfer_from_aws
(file_ids, gcp_bucket, gcp_project, description='', aws_creds=())[source]¶ Copies one or more ENCODE files from AWS S3 storage to GCP storage by using the Google STS. This is similar to the
gcp_transfer_urllist()
method - the difference is that S3 object paths are copied directly instead of using public HTTPS URIs, and AWS keys are required here.See
encode_utils.transfer_to_gcp.Transfer()
for full documentation.Parameters: - file_ids – list. One or more ENCODE files to transfer. They can be any valid ENCODE File object identifier. Don’t mix ENCODE files from across buckets.
- gcp_bucket – str. The name of the GCP bucket.
- gcp_project – str. The GCP project that is associated with gcp_bucket. Can be given in either integer form or the user-friendly name form (i.e. sigma-night-206802)
- description – str. The description to show when querying transfers via the Google Storage Transfer API, or via the GCP Console. May be left empty, in which case the default description will be the value of the first S3 file name to transfer.
- aws_creds – tuple. Ideally, your AWS credentials will be stored in the environment.
For additional flexability though, you can specify them here as well in the form
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
.
Returns: The JSON response representing the newly created transferJob.
Return type: dict
-
upload_file
(file_id, file_path=None, set_md5sum=False)[source]¶ Uploads a file to the Portal for the indicated file record. The file to upload can be specified by setting the file_path parameter, or by using the value of the ENCODE file profile’s submitted_file_name property of the given file object represented by the file_id parameter. The file to upload can be from any of the following sources:
- Path to a local file,
- S3 object, or
- Google Storage object
For the AWS option above, the user must set the proper AWS keys, see the wiki documentation.
If the dry-run feature is enabled, then this method will return prior to launching the upload command.
Parameters: - file_id – str. An identifier of a file record on the ENCODE Portal.
- file_path – str. The local path to the file to upload, or an S3 object (i.e s3://mybucket/test.txt), or a Google Storage object (i.e. gs://mybucket/test.txt). If not set, defaults to None in which case the local file path will be extracted from the record’s submitted_file_name property.
- set_md5sum – bool. True means to also calculate the md5sum and set the file record’s md5sum property on the Portal (this currently is only implemented for local files and S3; not yet GCP). This will always take place whenever the property isn’t yet set. Furthermore, setting to True will also cause the file_size property to be set. Normally these two properties would already be set as they are required in the file profile, however, if the wrong file was originally uploaded, then they must be reset when uploading a new file.
Raises: encode_utils.exceptions.FileUploadFailed
– The return code of the AWS upload command was non-zero.
-
get_platforms_on_experiment
(rec_id)[source]¶ Looks at all FASTQ files on the specified experiment, and tallies up the varying sequencing platforms that generated them. The platform of a given file record is indicated by the platform property. This is moreless used to verify that there aren’t a mix of multiple different platforms present as normally all reads should come from the same platform.
Parameters: rec_id – str. DCC identifier for an experiment. Returns: The de-duplicated list of platforms seen on the experiment’s FASTQ files. Return type: list
-
post_document
(document, document_type, description)[source]¶ POSTS a document to the Portal.
The alias for the document will be the lab prefix plus the file name. The lab prefix is taken as the value of the DCC_LAB environment variable, i.e. ‘michael-snyder’.
Parameters: - document_type – str. For possible values, see https://www.encodeproject.org/profiles/document.json. It appears that one should use “data QA” for analysis results documents.
- description – str. The description for the document.
- document – str. Local file path to the document to be submitted.
Returns: The DCC UUID of the new document.
Return type: str
-
download
(rec_id, get_stream=False, directory=None)[source]¶ Downloads the contents of the specified file or document object from the ENCODE Portal to either the calling directory or the indicated download directory. The downloaded file will be named as it is on the Portal.
Alternativly, you can get a reference to the response object by setting the get_stream parameter to True. Useful if you want to inspect the response, i.e. see if there was a redirect and where to, or download the byte stream in a customized manner.
Parameters: - rec_id – str. A DCC identifier for a file or document record on the Portal.
- directory – str. The full path to the directory in which to download the file. If not specified, then the file will be downloaded in the calling directory.
Returns: str. The full path to the downloaded file if the get_stream parameter is False. requests.models.Response: The get_stream parameter is True.
-
s3_object_path
(rec_id, url=False)[source]¶ Given an ENCODE File object’s id (such as accession, uuid, alias), returns the full S3 object URI, or HTTP/HTTPS URI if url=True.
Parameters: - rec_id – str. A DCC object identifier of the record to link the document to.
- url – bool. True means to return the HTTP/HTTPS URI of the file rather than the S3 URI. Useful if this is a released file since you can download via the URL.
-
get_experiments_with_biosample
(rec_id)[source]¶ Returns all experiments that have a link to the given biosample record. Technically, there should only be at most one experiment linked to a given biosample, but it’s possible that additional experiments can be, incorrectly, with audit flags going off.
Parameters: rec_id – str. An identifier for a biosample record on the Portal. Returns: - list of dicts, where is dict is the JSON serialization of an experiment record that is
- linked to the provided biosample record. If no experiments are linked, then this will be an empty list.
-
get_biosample_type
(classification, term_id=None, term_name=None)[source]¶ Searches the biosample_types for the given classification (i.e. tissue, cell line) and term_id or term_name. Both term_name and term_id need not be set - if both are than term_id will take precedence. The combination of classification and term_id/term_name uniquely identifies a biosample_type.
Parameters: - classification – str. A value for the ‘classificaiton’ property of the biosample_ontology profile.
- term_id – str. A value for the ‘term_id’ property of the biosample_ontology profile.
- term_name – str. A value for the ‘term_name’ property of the biosample_ontology profile.
Returns: dict. Empty if not biosample_type found, otherwise the JSON representation of the record.
Raises: - RecordNotFound – No search results.
- Exception – More than one search result was returned. This should not happen and if it does then it’s likely a bug on the server side.